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Preface 


Optimization is the process of finding the best solution to a problem subject to a 
set of constraints. It has long been a cornerstone of both engineering and math- 
ematics. The evolution of optimization can be traced back to the ancient Greeks, 
who employed geometric methods to solve optimization problems. In the eighteenth 
and nineteenth centuries, optimization began to take on a more formalized approach 
with the development of calculus and the rise of industrialization. Mathematicians 
such as Leonhard Euler and Joseph-Louis Lagrange developed methods for finding 
the maximum and minimum values of functions, which were crucial for optimizing 
industrial processes and designing efficient machines. The twentieth century saw 
a significant expansion in the field of optimization, with the development of linear 
programming and other optimization techniques. Linear programming is a mathemat- 
ical technique for optimizing a linear objective function subject to linear constraints, 
which involves optimizing a linear objective function subject to linear constraints, 
was first introduced by George Dantzig in the 1940s and quickly became a powerful 
tool for solving complex optimization problems. In the latter half of the twentieth 
century, optimization began to be applied to a wide range of fields beyond math- 
ematics and engineering. Operations research, which uses mathematical models to 
optimize complex systems, became a popular field in business and management. Opti- 
mization techniques were also applied to fields such as finance, transportation, and 
telecommunications. These models are typically static and deterministic, meaning 
they do not take into account the dynamic nature of many real-world systems. Over 
time, researchers have developed increasingly sophisticated optimization algorithms 
that can adapt to changing conditions and learn from experience. 

Adaptive dynamic programming (ADP) is a powerful optimization technique for 
improving dynamic systems. Despite being a relatively new area of optimization, 
it has received broad usage in various industries. ADP is rooted in the principle 
of dynamic programming, which involves breaking down a complex optimization 
problem into smaller subproblems. These subproblems are solved recursively using 
backward induction. ADP learns from feedback and adjusts its behavior accordingly, 
making it useful for systems operating in uncertain environments. It has been applied 
toa wide range of problems, including electrical power systems, robotics, self-driving 
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cars, and trading strategies. This book focuses on the practical application of ADP 
in chemotherapy drug delivery, taking into account clinical variables and real time 
data. ADP’s ability to adapt to changing conditions and make optimal decisions in 
complex and uncertain situations makes it a valuable tool in addressing pressing 
challenges in healthcare and other fields. As optimization technology evolves, we 
can expect to see even more sophisticated and powerful solutions emerge. 


Shenyang, China Jiayue Sun 
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Chapter 1 A) 
Introduction E 


The stability analysis of dynamical systems, which are ubiquitous in nature, has 
long been a hot topic of research and several approaches have been proposed. How- 
ever, control scientists often demand optimality in addition to the stability of the 
control system. In the 1950s and 1960s, motivated by the development of space 
technology and the practical use of digital computers, the theory of optimization of 
dynamical systems developed rapidly, forming an important branch of the discipline: 
optimal control. It is increasingly used in many fields, such as space technology, sys- 
tems engineering, economic management and decision-making, population control, 
and optimization of multi-stage process equipment. In 1957, Bellman proposed an 
effective tool for solving optimal control problems: the dynamic programming (DP) 
method [1]. At the heart of this approach is Bellman’s optimality principle, which 
states that the optimal policy for a multilevel decision process has the property that, 
regardless of the initial state and initial decision, the remaining decisions must also 
be optimal for the state formed by the initial decision. This principle can be reduced 
to a basic recursive formula for solving multilevel decision problems by starting at 
the end and working backward to the beginning. It applies to a wide range of discrete, 
continuous, linear, nonlinear, deterministic, and stochastic systems. 

ADP is a new approach to approximate optimality in the field of optimal control, 
and it is a current research topic in the international optimization field. The ADP 
method uses the function approximation structure to approximate the solution of the 
Hamilton-Jacobi-Bellman (HJB) equation and uses offline iteration or online update 
to obtain the approximate optimal control strategy of the system, which can effec- 
tively solve the optimal control problem of nonlinear systems [2-11]. Bertsekas et 
al. summarized neuronal dynamic programming in the literature [12, 13], describ- 
ing in detail dynamic programming, the structure of neural networks, and training 
algorithms. Meanwhile, several effective methods have been proposed for apply- 
ing neuronal dynamic programming. Si et al. summarized the development of ADP 
methods in cross-cutting disciplines and discussed the connection of DP and ADP 
methods with artificial intelligence, approximation theory, control theory, operations 
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research, and statistics [14]. In [15], Powell showed how to use ADP methods to 
solve deterministic or stochastic optimization problems, and pointed out the direc- 
tion of ADP methods. In [16], Balakrishnan et al. concluded previous approaches to 
the design of feedback controllers for dynamic systems using the ADP method from 
both model and model-free cases. In [17], the ADP method was described from the 
perspective of requiring initial stability and not requiring initial stability. 

The ADP method has a unique algorithm and structure compared to other existing 
optimal control methods. It overcomes the drawback that classical variational theory 
cannot handle optimal control problems with closed-set constraints on the control 
variables. Like the maximum value principle, the ADP method is not only suitable 
for optimal control problems with open-set constraints, but also for optimal control 
problems with closed-set constraints. While the extreme value principle can only 
provide the necessary conditions for optimal control problems, the DP method gives 
sufficient conditions. However, the direct application of the DP method is difficult 
due to the difficulty of solving the problem of “dimensional disaster” by the HJB 
equation in the DP method. Hence the ADP method, as an approximate solution to the 
DP method, overcomes the limitations of the DP method. It is more suitable for appli- 
cations in systems with strong coupling, strong nonlinearity and high complexity. 
For example, the literature [18] presented a constrained adaptive dynamic program- 
ming (CADP) algorithm that could be used to solve general nonlinear non-affine 
optimal control problems with known dynamics. Unlike previous ADP algorithms, 
it was able to handle problems with state constraints directly by proposing a con- 
strained generalized policy iteration framework that transforms the traditional policy 
improvement process into a constrained policy optimization problem with state con- 
straints. To solve the problem of robust tracking control, the literature [19] designed 
an online adaptive learning structure to build a robust tracking controller for nonlinear 
uncertain systems. The literature [20] proposed an iterative method of bias policy for 
solving data-driven optimal control problems for unknown continuous linear systems 
by adding a bias parameter that could further relax the conditions of the initial admis- 
sible controller. The literature [21] considered the first attempt at ADP control for 
a nonlinear It6-type stochastic system, which transformed a complex optimal track- 
ing control problem into a stable control optimization problem by reconstructing a 
new stochastic augmented system. The use of a critical neural network in iterative 
learning subsequently simplifies the structure of the behavioral criterion and reduces 
the computational load. The ADP approach is still very widely used for a number of 
common practical systems. The literature [22] developed an event-triggered adaptive 
dynamic planning method to design formation controllers, and solved the problem 
of distributed formation control for multi-rotor UAS. For wind/light energy hybrid 
systems, literature [23] presented an adaptive dynamic programming method based 
on Bellman’s principle, which enables accurate current sharing and voltage regula- 
tion. Based on this approach, it is possible to obtain the optimal control variables for 
each energy body objective. 

Optimal control of nonlinear systems has been one of the hot spots and difficulties 
in the field of control research. As a novel technology to solve the optimal control 
problem, ADP method integrates the theories of neural network, adaptive evaluation 
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design, augmented learning and classical dynamic programming, to overcome the 
problem of “dimensional disaster”, which also enables the acquisition of an approx- 
imate optimal closed-loop feedback control law. As a consequence, delving deeper 
into the theory of ADP and its algorithms for solving optimal control of nonlin- 
ear systems holds immense theoretical significance and practical application value. 
Although the researches on the ADP method are still in its early stages, this book 
aims to equip readers with a foundational understanding of the method and empower 
them to apply it to diverse optimization problems in fields such as medicine, science, 
and engineering. 


1.1 Optimal Control Formulation 


There are several schemes of dynamic programming [1, 13, 24]. One can con- 
sider discrete-time systems or continuous-time systems, linear systems or nonlin- 
ear systems, time-invariant systems or time-varying systems, deterministic systems 
or stochastic systems, etc. Discrete-time (deterministic) nonlinear (time-invariant) 
dynamical systems will be discussed first. Time-invariant nonlinear systems cover 
most of the application areas and discrete time is the basic consideration for digital 
implementation. 


1.1.1 ADP for Discrete-Time Systems 


Consider the following discrete-time nonlinear systems: 
Xe = F(x, ux), k = 1,2,..., (1.1) 


where x, € R” is the state vector and ug € R” is the control input vector. The corre- 
sponding cost function (performance index function) of the system takes the form of 


CO 
J (xk, Ta) = X VU Qe, ui), (1.2) 
isk 


where Ux = (uk, Uk+1, ---) is the control sequence starting at time k. U (x;, u;) is the 
utility function. ~ is the discount factor, meeting 0 < y < 1. Note that the function 
J is dependent on the initial time k and the initial state x. Generally, it is desired 
to determine wo = (uo, U4, ...) so that J(xo, Uo) is optimized (i.e., maximized or 
minimized). We will use m% = (uj, uj, ...) and J*(xo) to denote the optimal control 
sequence and the optimal cost function, respectively.The objective of dynamic pro- 
gramming problem in this book is to determine a control sequence ug, k = 0, 1, ..., 
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so that the function J (i.e., the cost) in (1.2) is minimized. The optimal cost function 
is defined as 
J* (xo) = inf J (xo, Wo) = J (xo, WG), (1,3) 
uo 


which is dependent upon the initial state xo. 

The control action may be determined as a function of the state. In this case, 
we write ug = u(x), Vk. Such a relationship, or mapping u : R” —> R”, is called 
feedback control, or control policy, or policy. It is also called control law. For a given 
control policy p, the cost function in (1.2) is rewritten as 


FG) = >of UG HG (1.4) 


i=k 


which is the cost function for system (1.1) starting at xk when the policy ug = (xz) 
is applied. The optimal cost for system (1.1) starting at xo is determined as 


J* (xo) = inf J” (x0) = J” (x0), (1.5) 
H 


where u* denotes the optimal policy. 

Dynamic programming is based on Bellman’s principle of optimality [1, 13, 24]: 
An optimal (control) policy has the property that no matter what previous decisions 
have been, the remaining decisions must constitute an optimal policy with regard to 
the state resulting from those previous decisions. 

According to Bellman, the minimum cost of any state starting at time k consists 
of two parts, one of which is the minimum cost at time k and the other part is the 
cumulative sum of the infinite minimum cost starting from time k + 1. In terms of 
equations, this means that 


J? (XK) = min{U (xp, ux) + VF" ee} 


: (1.6) 
= min{U (xx, uk) + yJ*(F (xg, ux))}- 


This is known as the Bellman optimality equation, or the discrete-time Hamilton- 
Jacobi-Bellman (HJB) equation. One then has the optimal policy, i.e., the optimal 
control už at time k is the ux that achieves this minimum as 


u* = arg min J{U (xe, ux) + J" Gn) (1.7) 


Since one must know the optimal policy at time k + 1 to (1.6) use to determine the 
optimal policy at time k, Bellman’s principle yields a backwards-in-time procedure 
for solving the optimal control problem. It is the basis for dynamic programming 
algorithms in extensive use in control system theory, operations research, and else- 
where. 
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1.1.2 ADP for Continuous-Time Systems 


For continuous-time systems, the cost function J is also the key to dynamic pro- 

gramming. By minimizing J, one gets the optimal cost function J*, which is often 

a Lyapunov function of the system. As a consequence of the Bellman’s principle 

of optimality, J*satisfies the Hamilton-Jacobi-Bellman (HJB) equation. But usually, 

one cannot get the analytical solution of the HJB equation. Even to find an accurate 

numerical solution is very difficult due to the so-called curse of dimensionality. 
Consider the continuous-time nonlinear dynamical system 


X(t) = F(x), u(t)), t = to, (1.8) 


where x € R” is the state vector and u € R” is the control input vector. The corre- 
sponding cost function of the system can be defined as 


J (xo, u) = J U(x(T),u(T))dT, (1.9) 


to 


with utility function U (x,u) > 0, where x(tọ) = x9. The Bellman’s principle of 
optimality can also be applied to continuous-time systems. In this case, the optimal 
cost 

J*(x(t)) = eG), u(t))},t > to, (1.10) 


satisfies the HJB equation 


* *x\ T 
— u = min { U (x,u) + a F(x,u)}. (1.11) 
Ot u(t) Ox 


The HJB equation in (1.11) can be derived from the Bellman’s principle of opti- 
mality [24]. Meanwhile, the optimal control u* (t) will be the one that minimizes the 
cost function, 

u* (t) = arg min{J Œ (2), u(t))}, t > to. (1.12) 


In 1994, Saridis and Wang [25] studied the nonlinear stochastic systems 
described by 


dx = f(x, t)dt + g(x, Nu dt +h(x, dw, to < t < T (1.13) 


with the cost function 


T 
J (xo, u) = E {f (Q(x, t) + uu) dt + ġ(x(T), T) : x (to) = vo} 
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where x € R”, u € R”, and w €e R¥ are the state vector, the control vector, and a 
separable Wiener process; f, g and h are measurable system functions; and Q and 
ġ are nonnegative functions. A value function V is defined as 


T 
Va,pn= | f (Ox, t)+ uu) dt + o(x(T), T) : x (to) = xo] tel, 


where J £ [t), T]. The HJB equation is modified to become the following equation 


OU + BV + OHUTU = WV (1.14) 


where -%, is the infinitesimal generator of the stochastic process specified by (1.13) 
and is defined by 


T 
ZV a5 fac th" (x, n (==) | 


T 
+ (ZED) Jentgen 
x 

Depending on whether VV < 0 or VV > 0, an upper bound V or a lower bound 
V of the optimal cost J* are found by solving equation (1.14) such that V < J* < V. 
Using V (or V) as an approximation to J*, one can solve for a control law. This leads 
to the so-called suboptimal control. It was proved that such controls are stable for the 
infinite-time stochastic regulator optimal control problem, where the cost function 
is defined as 


1 T 
J (xo, 4) = lim 7 f (0,1) +0 u) ar x) = x0] 


The benefit of the suboptimal control is that the bound V of the optimal cost J* can 
be approximated by an iterative process. Beginning from certain chosen functions 
uo and Vo, let 


ete (1.15) 


1 T 
uj (x, t) = ~ 59 (x, t) 


Then, by repeatedly applying (1.14) and (1.15), one will get a sequence of func- 
tions V;. This sequence {V;} will converge to the bound V (or V ) of the cost function 
J*. Consequently, u; will approximate the optimal control when i tends to oo. It is 
important to note that the sequences {V;} and {u;} are obtainable by computation and 
they approximate the optimal cost and the optimal control law, respectively. 
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Some further theoretical results for ADP have been obtained in [2]. These works 
investigated the stability and optimality for some special cases of ADP. In [2], Murray 
et al. studied the (deterministic) continuous-time affine nonlinear systems 


x= f(x) + g(x)u, x (to) = xo (1.16) 


with the cost function 


iow) = | U(x, u)dt (1.17) 


to 


where U (x, u) = Q(x) + uT R(x)u, Q(x) > Oforx Æ Oand Q (0) = 0,and R(x) > 0 
for all x. Similar to [25], an iterative procedure is proposed to find the control law 
as follows. For the plant (1.16) and the cost function (1.17), the HJB equation leads 
to the following optimal control law 


dJ 
u*(x) = -iR 1&)g" (x J zol, (1.18) 


Applying (1.17) and (1.18) repeatedly, one will get sequences of estimations of the 
optimal cost function J* and the optimal control u*. Starting from an initial stabilizing 
control vo(x), fori = 0, 1, ..., the approximation is given by the following iterations 
between value functions 


tug / U (x(r), vj(r)) dr 


and control laws 


1 dV; 
vite) = -5R g ro |T] 


The following results were shown in [2]. 

(1) The sequence of functions {V;} obtained above converges to the optimal cost 
function J*. 

(2) Each of the control laws v;+; obtained above stabilizes the plant (1.16), for 
all į = 0, 1,... 

(3) Each of the value functions V;,;(x) is a Lyapunov function of the plant, for 
alli =0,1,... 

Abu-Khalaf and Lewis [26] also studied the system (1.16) with the following 
value function 


Vax) = f U(x(T), u(T))dT = f (x7 (T) Qx(T)+ u? (T)Rx(T)) dr 


where Q and R are positive-definite matrices. The successive approximation to the 
HJB equation starts with an initial stabilizing control law vo(x). For i = 0, 1,..., 
the approximation is given by the following iterations between policy evaluation 
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0 =x" Ox + v/ Rua) + VV (x) (E) + guia) 


and policy improvement 
wat) = —L Rg? GOV; 
vi) = —p Rg WVV) 


where VV; (x) = OV;(x)/Ox. In [26], the above iterative approach was applied to 
systems (1.16) with saturating actuators through a modified utility function, with 
convergence and optimality proofs showing that V; > J* and v; > u*, asi > oo. 
For continuous-time optimal control problems, attempts have been going on for a 
long time in the quest for successive solutions to the HJB equation. Published works 
can date back to as early as 1967 by Leake and Liu [26]. The brief overview presented 
here only serves as a beginning of many more recent results [26-28]. 


1.2 Publication Outline 


The general layout of the presentation of this monograph is given as follows. Adap- 
tive dynamic programming is used to design drug dosage regulation mechanisms to 
provide adaptive viral treatment strategies for input-limited organisms, and to extend 
this to tumour cells, immune cells and interplay and regulation schemes among the 
immune system. The main contents of this monograph are shown as follows: 


Chapter 1 introduces the research background, development and current status 
of ADP both domestically and internationally, as well as the idea and design 
framework of the underlying ADP, including discrete-time and continuous-time 
systems. 

Chapter 2 investigates optimal regulation scheme between tumor and immune 
cells based on ADP approach. The therapeutic goal is to inhibit the growth of tumor 
cells to allowable injury degree, and maximize the number of immune cells in the 
meantime. The reliable controller is derived through the ADP approach to make 
the number of cells achieve the specific ideal states. Firstly, the main objective is 
to weaken the negative effect caused by chemotherapy and immunotherapy, which 
means that minimal dose of chemotherapeutic and immunotherapeutic drugs can 
be operational in the treatment process. Secondly, according to nonlinear dynam- 
ical mathematical model of tumor cells, chemotherapy and immunotherapeutic 
drugs can act as powerful regulatory measures, which is a closed-loop control 
behavior. Finally, states of the system and critic weight errors are proved to be 
ultimately uniformly bounded with the appropriate optimization control strategy 
and the simulation results are shown to demonstrate effectiveness of the cyber- 
netics methodology. 

Chapter 3 investigates the optimal control strategy problem for nonzero-sum 
games of the immune system based on adaptive dynamic programming. Firstly, 
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the main objective is approximating a Nash equilibrium between the tumor cells 
and the immune cell population, which is governed through chemotherapy drugs 
and immunoagents guided by the mathematical growth model of the tumor cells. 
Secondly, a novel intelligent nonzero-sum games-based ADP is put forward to 
solve optimization control problem through reducing the growth rate of tumor cells 
and minimizing chemotherapy drugs and immunotherapy drugs. Meanwhile, con- 
vergence analysis and iterative ADP algorithm are specified to prove feasibility. 
Finally, simulation examples are listed to account for availability and effectiveness 
of the research methodology. 

Chapter 4 devotes to evolutionary dynamics optimal control oriented tumor 
immune differential game system. Firstly, the mathematical model covering 
immune cells and tumor cells considering the effects of chemotherapy drugs 
and immune agents. Secondly, the bounded optimal control problem covering 
is transformed into solving HJB equation considering the actual constraints and 
infinite-horizon performance index based on minimize the amount of medication 
administered. Finally, approximate optimal control strategy is acquired through 
iteration dual heuristic dynamic programming algorithm avoiding dimensional 
disaster effectively and providing optimal treatment scheme for clinical applica- 
tions. 

Chapter 5 mainly proposes an evolutionary algorithm and its first application 
to develop therapeutic strategies for Ecological Evolutionary Dynamics Systems 
(EEDS), obtaining the balance between tumor cells and immune cells by rationally 
arranging chemotherapeutic drugs and immune drugs. Firstly, an EEDS nonlinear 
kinetic model is constructed to describe the relationship between tumor cells, 
immune cells, dose, and drug concentration. Secondly, the N-Level Hierarchy 
Optimization (NLHO) algorithm is designed and compared with 5 algorithms 
on 20 benchmark functions, which proves the feasibility and effectiveness of 
NLHO. Finally, we apply NLHO into EEDS to give a dynamic adaptive optimal 
control policy, and develop therapeutic strategies to reduce tumor cells, while 
minimizing the harm of chemotherapy drugs and immune drugs to the human 
body. The experimental results prove the validity of the research method. 

Chapter 6 investigates the optimal control strategy for organism by using ADP 
method under the architecture of Firstly, a tumor model is established to formulate 
the interaction relationships among normal cells, tumor cells, endothelial cells 
and the concentrations of drugs. Then, the ADP-based method of single-critic 
network architecture is proposed to approximate the coupled HJEs under the 
medicine dosage regulation mechanism (MDRM). According to game theory, the 
approximate MDRM.-based optimal strategy can be derived, which is of great 
practical significance. Owing to the proposed mechanism, the dosages of the 
chemotherapy and anti-angiogenic drugs can be regulated timely and necessarily. 
Furthermore, the stability of the closed-loop system with the obtained strategy is 
analyzed via Lyapunov theory. Finally, a simulation experiment is conducted to 
verify the effectiveness of the proposed method. 

Chapter 7 investigates the constrained adaptive control strategy based on virother- 
apy for organism using the MDRM. Firstly, the tumor-virus-immune interaction 
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dynamics is established to model the relations among the tumor cells (TCs), virus 
particles and the immune response. ADP method is extended to approximately 
obtain the optimal strategy for the interaction system to reduce the populations 
of TCs. Due to the consideration of asymmetric control constraints, the non- 
quadratic functions are proposed to formulate the value function such that the 
corresponding Hamilton-Jacobi-Bellman equation (HJBE) is derived which can 
be deemed as the cornerstone of ADP algorithms. Then, the ADP method of 
single-critic network architecture which integrates MDRM is proposed to obtain 
the approximate solutions of HJBE and eventually derive the optimal strategy. 
The design of MDRM makes it possible for the dosage of the agentia containing 
oncolytic virus particles to be regulated timely and necessarily. Furthermore, the 
uniform ultimate boundedness of the system states and critic weight estimation 
errors are validated by Lyapunov stability analysis. Finally, simulation results are 
given to show the effectiveness of the derived therapeutic strategy. 
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Chapter 2 A) 
Neural Networks-Based Immune cia 
Optimization Regulation Using Adaptive 
Dynamic Programming 


2.1 Introduction 


In the fight against cancer, there had been no effective measures before chemotherapy 
and radiation appeared since there only exist tiny differences between cancer cells 
and normal cells. Doctors operate to remove solid tumors that have not yet spread, 
which can not guarantee cancer from recurring. When radiotherapy and chemother- 
apy have increased side effects, and targeted therapy is not flexible because of its 
strong pertinence, the scientific research direction began to turn to the human body 
system. Generally, tumor cells escape from the immune system, not because it fails 
for the immune system to recognize them or it is not activated, but cancer cells have 
evolved a way to block the activation of T cells by making a specific binding. Thus, the 
medical communities have struggled to find a lot of special means for cancer cells to 
intercept the activation of the T cells, freeing up the immune system. Compared with 
traditional treatments such as surgery, radiation and chemotherapy, immunotherapy 
has fewer side effects and better therapeutic effects. However, it is difficult to tackle 
the transient period of immune agents. Therefore, the hybrid therapy of chemother- 
apy and immunotherapy is a better choice. As [1], it is hardly sufficient to control 
tumor growth through neither chemotherapy nor immunotherapy alone, but tumor 
cells can be eradicated by adopting the combination therapies which is known as 
biochemotherapy described in [2]. 

With extensive development of nonlinear dynamic [3, 4], its engineering applica- 
tion scenarios enjoy increasing diversification such as competitive Nash equilibrium 
problems, especially in the biomedical field. And not coincidentally, game theory 
has been introduced into the interaction model of tumor cells and immune cells. Both 
of the chemotherapy and immunotherapy aim at reducing the number of tumor cells. 
Based on this fact, the collaborative game is formed and one can design adaptive 
therapy from the view of game theory. Multiple biological interactions constitute 
complex nonlinear growth process of tumor cells, however, regarding major influ- 
ence factors of tumor cell populations as research object is the focus. Hunting cells 
refer to the immune cells participating in removing foreign agents and strengthening 
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the immune response. temperatures have suggested that cell-mediated anti-tumor 
immunity contributed to increasing the population of hunting tumor cells to main- 
tain a specific proportion between the resting and the hunting predator cells as 40% in 
literature [5], which was beneficial for maintenance of the tumor dormant state. The 
immune regulations vary from individual to individual, but immunotherapy-based 
optimal regulation plays the role of reducing tumor cells without considering certain 
circumstances in case of special invocation. Enhanced tumor antigen presentation 
could effectively stimulate dendritic cells and increase the immunotherapy-based 
curative effect in [6]. The known “predator-prey” between immune cells and tumor 
cells leads to cyclic growth and reduction, which can be continue indefinitely or 
reach an equilibrium saddle point determined by system parameters. Literature [7] 
investigated nonlinear dynamical model which provided guiding significance for 
introducing that into cybernetics. As known, system identification or optimal control 
is of great practical value. As a powerful and effective optimization algorithm, the 
ADP method can solve the nonlinear optimal control problems well, realizing the 
most appropriate therapeutic strategy. 

Of course, the immune system has the responsibility for restraining tumor growth, 
but it is hardly to fight out the tumor cells alone. Firstly, ego characteristic of tumor 
cells compared to normal cells within the body leads to no exclusion and tolerance 
to tumor cells of the immune system. Secondly, there is no strong defense mecha- 
nism itself in fighting with the cancer cells which means the failure of the immune 
response. Finally, Immune function was observed to be protective through interven- 
tion with organic binding agents of CD4 and CD8 cells. Chemotherapy can not only 
rapidly kill differentiated tumor cells, but also destroy regular cells. This side effect 
caused by chemotherapy can be lessened through introducing the immunotherapy. 
Thus the combined therapy of chemotherapy and immunotherapy is more reason- 
able. Immunotherapies can strengthen the immune system through extra stimula- 
tion, on the other hand, improve the ability to recognize foreign entity. Therefore, 
decelerating the growth rate of tumor cells with minimized dose of chemotherapy 
and immunotherapeutic drugs is the control objective. Furthermore, optimal control 
strategy is obtained through ADP method, giving the optimal levels of each treatment 
regimen through nonzero-sum differential games strategy developed in [8]. 

Prescribed performance tracking control has been creatively developed as [9], 
however, there is seldom any literatures focusing on this scope considers mutual rela- 
tionship among tumor cells, immune cells, chemotherapy and immunotherapy drugs, 
let alone setting the performance as eventually acquired of optimal therapeutic effect 
associated with coupling behaviors mentioned above. Retrospect to literatures as [10], 
the chapter transformed it into multi-player nonzero-sum games problems whose 
optimal control was obtained by complex decoupling in dealing with Hamilton- 
Jacobi equation as [11]. Subsequently, online adaptive and off-policy learning algo- 
rithms were respectively developed in [12—14]. Of course, the constrained-input was 
taken into consideration, when it comes to practical applications in [15], even more 
intensive work on uncertain constraints were in contemplation considered as [16]. As 
[17], the control policies of the distributed subsystems acted as players, noticeably, the 
chapter was formulated as a two-players nonzero-sum game including chemotherapy 
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and immunotherapy. [18] first introduced an updating strategy based on intertask 
relationships. Synchronously, reciprocal action between the tumor cells and immune 
cells which could be analogous to interactions between systems in [19, 20]. 

The unknown nonlinear dynamic is usually implemented by fuzzy control as 
[21, 22] and neural networks in [18, 23], where the actor network and critic network 
are adopted for updating control policy at an appropriate time through policy iteration 
technique as [24-26]. The convergence of model-based policy iteration algorithm 
is equivalent to that of data-based learning as [27]. Similarly, states of the system 
and critic error are required to be ultimately uniformly bounded during the process 
of value iteration, which is guaranteed through event-triggered formation control 
scheme firstly proposed for all signals of the closed-loop system in literature [28]. 
According to the iterative value algorithm, the optimum can be obtained through 
learning continuously [29, 30]. However there is little research on the two-players 
nonzero-sum game considering tumor cells and immune cells using the proposed 
value iteration learning. 


2.2 Preliminaries 


As is known, there exist interaction relationships among the anticancer agent cells, 
lymphocytes and macrophages that constitute the basic immune system microen- 
vironment, which can be presented as follows. Firstly, T-lymphocytes and cyto- 
toxic macrophages/natural killer cells can effectively damage tumor cells. Secondly, 
destroyed behaviour of macrophages can also active T-lymphocytes for launching 
another attack. Meanwhile, the population of T-lymphocytes can be fed through rest- 
ing cells. Finally, the model is guided by degradation of resting cells and activation 
of immune cells by natural growth rate. This section gives the nonlinear growth 
equation which can represent the whole immune response. 


Niotal = UNE) (2.1) 
v + Nr(t) 


where Ny (t), Nr (t) denote the number of hunting cells and tumor cells at time t, 
respectively. v and v are positive constants. The changes in quantity caused by the 
inactivation of the immune cells and the apoptosis of tumor cells are presented as: 


aro = 0, Ny ONG) 
ao ONG (2.2) 


where gı denotes the loss rate of Np (t) caused by Ny (t) and o> represents the loss 
rate of Ny (t) caused by N7 (t). The situations above reflect the competition between 
tumor cells and the host cells. Then we construct the dynamic equations as follows 
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Table 2.1 Detailed descriptions of system parameters 


Parameter | Estimated value 


Intrinsic growth rate of Np (t) ignoring Ncp(t) 


01 Reciprocal carrying capacity of N7 (t) irrespective of Np (t) and Ncp(t) 
h Constant influx rate of Ny (t) 

(eal Rate of loss of Nr (t) for Ny (t) 

ôi Response coefficient to Ncp (t) for Nr (t) 

D Per capita decay rate of Ny (t) without regard to Np (t), Ncp(t) and N; p(t) 
02 Rate of loss of Ny (t) for Nr (t) 

02 Response coefficient to Nc p(t) for Ny (t) 

v Maximum recruitment rate of Ny (t) by ligand-transduced Np (t) 

ç Maximum recruitment rate of Ny (t) by N7 p(t) 

v Steepness coefficient of Ny (t) by Nr (t) 

V Steepness coefficient of Ny (t) by Nr p(t) 

pı Decay rate of Ncp (t) 

p2 Decay rate of N7 p(t) 


Nr(t) =a NTO — a Nr(t)) — Nr Na (t) 


— ôi Nco (t) Nr (t) 
: Ng (ONZ Ng (ON 
Nu(t) = ae rn - ee - oyNr(t)Nu(t) 
T 


— DNy (t) — &Nep(t)Na(t) 
(2.3) 


where D represents the death rate of cells without considering any tumor cells. 
la (a = 1,2) and oa denote the per capita growth rates and reciprocal carrying 
capacities. The descriptions of the other associated parameters are given in Table 2.1. 

Consider the given chemotherapy and immunotherapy drugs as u(t) and v(t) at 
time ft, which is regarded as multiple dose administration compared with influence of 
recombinant human interleukin-11 for injection or recombinant human granulocyte 
colony-stimulating factor injection. Assume that targeted therapy cannot be achieved 
through only chemotherapeutic drugs. Then we can obtain that 


Fresponse(t) = Sa(l = eu) (2.4) 


where sa is the different response coefficients for distinguishing the change rate of 
different cells. The mathematical model considering injected drugs is presented as 
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Nepv(t) =u(t) — pı Ncp(t) 
Nrp(t) = v(t) — p2N1p(t) 

Nr(t) =uNr(t)( — ai NT) — oi NT (t)Nu(t) 
— Nen (Nr (t) — s21 — e™™ O) 
VNHONF©A — SNa(t)Nrv(t) 

v + N2(t) 0+ Nrp(t) 

— &Nep(t)Nu(t) — si — eo) (2.5) 


Nu(t) = orNr(t)Nu(t) — DNu(t) 


where Nc p(t) and N7p(t) are concentrations of chemotherapy and immunotherapy. 
v(t) and u(t) are the doses of chemotherapeutic drug and immunotherapeutic drug. 
Generally speaking, A is taken as 1 for the unknown role of cytokines. 


Remark 2.1 The model (2.5) describes the relations among the hunting cells, the 
tumor cells, the concentration of chemotherapy agentia, and the concentration of 
immunotherapy agentia. From (2.5) we can find both of the hunting cells and the 
chemotherapy agentia can reduce the number of tumor cells, and the immunotherapy 
agentia can stimulate the growth of hunting cells. On the other hand, the tumor cells 
can influence the number of hunting cells. Based on this complicated interactive 
relationship, we can obtain the optimal object through ADP, that is, minimization of 
tumor cells while ensuring the number of normal cells at certain time f. 


Before proceeding, let X = [Nr, Nu, Ncp, Nıp]”, then the model (2.5) can be 
simplified as 


X(t) = fX + guU) + K(X) v1) (2.6) 
where f(X) is the right-hand dynamics of (2.5) excluding the control u(t) and v(t). 


The matrixes g(X) = [0, 0, 1, 0]? and «(X) = [0, 0, 0, 1]”. 
For system (2.6), the performance index function of the e player can be given as 


CO 
J.(X0) =| (x72.x +u Rau + v Rav) dr (2.7) 
0 


where Qe is positive definite matrix, Ra and Re are symmetric positive matrixes. 
The corresponding cost functions are presented as: 


U(X, u, v) = f R.(X, u, v)dT (2.8) 


with the utility function 


R(X, u, v) =X79.X +u Rau +u Rov. (2.9) 
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Definition 2.2 For two-player NZS game of system (2.6), the Nash equilibrium 
solution is said to be obtained with the control pair (u*, v*) which satisfied that, 


Vu", v*) < Ve(u, v*) 
Yu", v*) < Ve(u*, v) 


for any admissible control policies u and v. 


The Hamilton functions can be constructed as: 


H: (X, u, v) = XTOX +u Rau +u Rov 
+ VVE (F(X) + g(X)u(t) + KX) 


(2.10) 


(2.11) 


where VV, is the partial derivative of the cost function and € = 1, 2. According to 
the stationarity conditions at equilibrium points, the optimal control for two players 


are obtained 
1 
u* = -58n g OVO} 
1 
v= — 5 Rn K OVV 


with V* and V% being the solutions of coupled HJ equations as 


1 
XTQ X — ZV GOR g OVV] + VVT f(X) 


1 

+ IV R(X)Ryy RaRa K (X)VV; 
1 

— z VVE KO Rag K (XV =0, 


and 


1 
XTQX — JV ROO Ry K OVV + VVT f(X) 


1 
+ VVE IOR Ra Rig! OVV 


1 m 
— 5V JORD g" OVV =0. 


(2.12) 


(2.13) 


(2.14) 


Lemma 2.3 Fornonlinear system (2.6), suppose that V* and V% satisfy the equations 
(2.13) and (2.14). Then under the optimal control (2.12), the system is asymptotically 


stable. 


Proof The proof is omitted since it is similar to that in [31, 32]. 
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By solving the coupled HJ equations (2.13) and (2.14), one can obtain the optimal 
control as (2.12), which means the Nash equilibrium for the two-player NZS game 
system is attained. Nevertheless, due to the existence of nonlinear terms and coupled 
terms, these partial differential equations are uneasy to solve. Since ADP is a powerful 


approximate learning method, the approximate solutions of (2.13) and (2.14) can be 
acquired. 


2.3 Design of Adaptive Controller 


In order to find the optimal control strategy, a critic network is constructed based on 
neural network firstly. And then optimal value function can be shown as: 


OF = (TEX) +0, € = 1,2, (2.15) 
where ¢* € R”, é € R” ando, € R are the ideal weight vector, activation function 


and approximation error of the neural network. As it’s scarcely possible to get the 
weight ¢*, we give the approximate version 


OF = (C7 E(X). (2.16) 
Based on (2.12) and (2.15), we obtain the optimal control as 
ut = ERTI (XM(VEX))" GE + Vou) 
= Tai g 1 1 01 
1 
v= -5 Ra XNMVOEX)"G + Voz) (2.17) 
Then we further get the approximate control policies as 
a lp- T TA 
u = 5 Ri g (X)\(V&(X) & 


1 A 
d= — 5 Rag Rh OVEN O (2.18) 


Remark 2.4 For the unknowable nature of ideal weights, the NNs are used to 
approximate the system dynamics and approximate version as (2.16), aming at min- 
imizing the current estimate of the value functions in (2.15) by selecting policies 
(2.18) can be obtained with available closed-form expressions. 


According to (2.18), the closed-loop system can be rewritten as 


X(t) = f(X) + g(X)û + K(X). (2.19) 
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Furthermore, we can attain the approximate Hamilton as 


H,.(X, û, ô) = X7’Q.X +i Rei 


+ 87 Rod + (T VE(X) X(t) 
= e(t). (2.20) 


To approach the optimal strategy and minimize e(t), the goal of adaptive learn- 
ing is set to be 6 = & + & = 1/2e7 + 1/2e3. Then applying the gradient descent 
method, we obtain the learning law of critic for player e 


= 1 O6(t) _ 1 8 (t) _ AG) 
eT +I oi “Oat aC OFA. +1 
(2.21) 


where 6, = VE-(X) X(t), and oe is the positive learning law. Let & =(= A then 
we have 


> AO) 6.6! & 


Ce = Oc (576 + 1)? Oc (575 4. 1? = 0.5, 0¢(t) m 0-5-0! Ce, (2.22) 


where 6, = ĝe/ (ôT 5. + 1)’, Oc = 6-/(67 5. +1) and o,(t) = —Vo! (X)(f (X) + 
g(X)u + K(X)b) is the approximate residual error when employing critic neural 
network [33]. 

Before presenting the main results of this chapter, two regular assumptions are 
necessary [34-36]. 


Assumption 2.1 For € = 1, 2, the signal 6, is persistently excited such that the fol- 
lowing inequality is satisfied 


t+T 
oe ees f 6.6: de, (2.23) 
t 


where ve denotes the neuro number of the eth critic network. 


Assumption 2.2 For e = 1, 2, there exist positive constants €¢nax, Oemax ANd Cemax 
such that the following inequalities hold, that is, || VE&_(X)|| < Emax, || Voll < Oemax 
and [oe | < Oemax: 


Applying the Lyapunov method, the stability in the sense of UUB is demonstrated 
to be guaranteed by the following theorem. 


Theorem 2.5 For system (2.6),when the weight updating laws of critic networks are 
given by (2.21), then the UUB properties of the weight estimation error Çe can be 
guaranteed by the obtained control policies (2.18). 
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Proof Select the Lyapunov function as 


oo ae 
base GG +a GG. (2.24) 


Taking the time derivative of (2.24), then we obtain 
=p Q+ 
=Q] Gor) — 61671) + GF (6202(t) — 6363 Q2) (2.25) 
According to Young’s inequality, we have 
ZT ZTS lorz srz 1 , 
i ĝia lt) < Ci ialt) < A 616, Ci + aimas (2.26) 
Similarly, 
ZT lorz srz 1 , 
Q 6502(t) < z 5265 6 + z7 2max" (2.27) 


Substituting (2.26) and (2.27) into (2.25), we get 


P SA | ee 1 
L < 5010101 G — 52 0202 + T +02,.,). (2.28) 
From (2.28) we can conclude that -Ê < 0 when one of the following conditions 
holds 
2 2 
> Omax +o max 
I&I > o , i 
Amin (1 ôi ) 
or 


2 2 
Omax F Tmax 


od 7 Amin (0202 ) l 


(2.30) 


According to Lyapunov theory, it yields that the weight estimation errors for both 
critic networks are UUB. 


Remark 2.6 The weight matrices are usually updated through certain renewal equa- 
tions, and from (2.29) and (2.30), we can draw that the approximation weight error 
will asymptotically converge to zero as Ve > O0. 


Theorem 2.7 Consider the system (2.6). The weight updating laws for critic net- 
works are given by (2.21). Then the obtained policies (2.18) can force system states 
X to be UUB. 
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Proof In order to discuss the stability of closed-loop system, the derivative of V = 
VĚ + Vy is considered as 


V =(VU*)" (F(X) + g(X)û + K(X)ô) 
+ VVT) +g(X)ûù + K(X)D). (2.31) 


Recalling (2.13) and (2.14), we have 


1 
VVT f(X) = — XTQ X + VVE SORT g OVO! 


1 
-— V R(X) Roy RaRa K OVV 
1 *T -1 T * 
+ SVU KORY K OVV, (2.32) 
and 


1 
VEE FO) = AAS VE ROR KVS 


1 
— VVE IOR RaR g” OVV 
1 
+ sV GOR g OVV. (2.33) 
For e = 1, we can obtain vx as 


, 1 
Vt =- X7QX — AVM IORI g (XV! 


1 
-— IV RX) Roy RaRa K OVV 
— VUiT (g(X)(u* — fi) + K(X)(v* — ô)). (2.34) 


According to (2.15) and (2.16) we have 


1 
Vi =— XTQ X- ZV IORI g OVV 


= TVT KOR RnR KT ONS 
1 
+E + Vor)” (JORRIT 
x (CVET XD G + Vor) + KORY K” X) 
x (VET + Vor)). (2.35) 
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Due to Assumption 2.2 and Theorem 2.5, we obtain that 


f 1 
Vë <- XQ X- ZV IDR GOV} 


1 
- ZV UP R(X) Ryy RaRa K OVV +ô, 
(2.36) 
where the positive constant 6; denotes the bound of the term S((VE U(X) TCE + 
Vor)" (JORDI (XI (VET (XD) G + Vor) + RXR XV(VE OTG 


+Vor)). As 11, Riz and Rz are symmetric positive definite, we have 


LV UST ROO Rey RNR k7 (X)VUZ 
4 AVUFT GOR tg" COW} > 0. (2.37) 
Furthermore, we attain 
Of < -XTX + 0i < —Amin(Qu) XI? + 01. (2.38) 
Similarly, for e = 2, it yields that 


UF < -XTQ X +0 < —Amin(Q2)|| XI? + 02, (2.39) 


where the definition of @ is similar to that of 604. Then it can be concluded that 0 <0 
when the following inequality is satisfied 


|| X || > max 0: 02 2o (2.40) 
Xmin (Q1) , Amin (Q2) l f 


Thus with the proposed control policies (2.18), the system state N is UUB with 
the bound ©. This completes the proof. 


Remark 2.8 From Theorems 2.5 and 2.7, we can conclude that under the obtained 
control policies the states of the system X and the critic weight error ¢, are ultimately 
uniformly bounded. 


Remark 2.9 According to the clinical requirements, the specific value of the cost 
function is finalised. Transformation is implemented from the mathematical mecha- 
nism model to the solvable affine model. Subsequently, the chapter solve the optimal 
control problem that means minimum dose of medicine can realize the best thera- 
peutic effect. 
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2.4 Simulation and Numerical Experiments 


To verify the proposed method in the previous section, a simulation is given as 
followed. 


2.4.1 States Analysis on Tumor Cell Growth 


According to clinical medical statistics borrowed from the literature [37], the specific 
parameters of the dynamic models are presented as Table 2.2. 
According to (2.5) and Table 2.2, we construct the model (2.41) 


Nr (t) = 0.00431Nz(t)(1 — 1.02 x 10-°)Nr(t)) 
— 6.41 x 107 Nr (O) Ng (t) 
— 0.08Ncp(t)N7(t) — (1 — e™®) 
0.0125Ny(t)NZ(t) — 0.125Ny(t) Ny p(t) 
2.02 x 1074+ N2(t) | 2x 107+ Nyp(t) 
— 3.42 x 10-°N 7 (H) Np (t) — (1 — e™"®) 
— 0.204N y(t) — 3.42 x 10-°Nep(t) Nu (t) 
Nep(t) =u(t) — 0.1Ncp(t) 
Niv(t) = v(t) — Nipt) (2.41) 


Nu(t) =0.33 + 


The initial state of tumor cells N; (t) and immune cells N3 (t) ina patient and follow 
a certain chemotherapy and immunotherapy regimen. Correspondingly, N3(t) and 
Na(t) respectively denote the concentrations of chemotherapy and immunotherapy. 
And we can get the following curves on systems states tumor cells, immune cells, 
chemotherapy and immunotherapy drugs as shown in Fig. 2.1. Initial value is set as 
Xo = [20 1086]’. 


Table 2.2 Concentration variation on immune cells, tumor cells, chemotherapeutic drug and 
immunoagents 


Estimated value 
0.00431 
6.41 x 1071! 


Parameter 


0.08 


D 0.204 o2 3.42 x 1076 cell”! 
62 2x 1071! day! v 0.0125 day™! 
s 0.125 day™! 2.02 x 107 cell? 

0 2x 107 cell 0.1 day~! 


1 
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Fig. 2.1 The curves of system states 


It is obviously that the control policies can stabilize the nonlinear system and 
make the system states tend to zero which means that the closed-system is stable 
and the control method is effective. Retrospect the original problem that the key is 
to minimize cancer cells and reduce therapy toxicity as possible. 


2.4.2 Weight Analysis of Control Policies 


The weights ¢* of the control policies u (t) and v(t) can be estimated through the value 
function co = (Ê) T£ (X) in (2.16), and the performance index is shown as (2.6) with 

Q1 = Igyg, & = 591, Rii = Ry = 1, Rip = Ro = 2. The initialize weights are 
set as [—0.25, —0.25, —1, —0.25]”. The selected activation function is selected as 
[Ch vise Groote» Chao) where Cis = INFO), MON), NON), Ni ONCE), 
N3(t)] and Çi6>18 = [N2(t)N3(t), No(t)Na(t), NZ(t)] and ioio = [N3 (N4), 
Ni (O)] 

According to Fig. 2.2, we can conclude that the proposed optimal control demon- 
strated a shorter convergence time than that without taking optimal control, where 
the former needs only 10s, but the later may be 38s, which draws the superiority of 
the proposed method. 

In Fig. 2.3, we can obtain the less doses of the drugs is another advantage compared 
with that without taking optimal control. Taking comprehensive consideration of 
Figs. 2.2 and 2.3, we can draw a conclusion that the adopted algorithm can not only 
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Fig. 2.2 Optimal control policies u(t) 
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The control policy v(t) 
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Fig. 2.3 Optimal control policies v(t) 
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The system states 


Time (s) 


Fig. 2.4 The curves of system states 


decrease the convergence time but also reduce doses of chemotherapy drugs and 
immune agents, and patients will benefit from for the minimal toxicity and shorter 
response time. 

When the initialize state is set as [—0.5, —0.1, —1, —0.4]", and the other param- 
eters are unaltered, we give another set of figures as Figs. 2.4, 2.5 and 2.6. In Figs. 2.5 
and 2.6, there exist more obvious advantages for the proposed algorithms over that 
without taking optimal control in response time and control policies,and we can 
conclude that effectiveness of the control method does not vary in the different ini- 
tial weights. 


2.5 Conclusion 


This chapter has introduced adaptive dynamic programming into solving the optimal 
control policies of tumor cells growth model and realized objective of minimizing 
tumor cells with the minimum dose of chemotherapeutic and immunotherapeutic 
drugs. As is known, the negative effect caused by chemotherapy and immunotherapy 
must be reduced for the reasonable treatment plan extracted from the optimal con- 
trol behavior. Convergence properties have been proved to be guaranteed through 
Lyapunov theory. Meanwhile, states of the system and critic error have been demon- 
strated to be ultimately uniformly bounded. Simulations have been given to verify 
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Fig. 2.5 Optimal control policies u(t) 
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Fig. 2.6 Optimal control policies v(t) 
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rationality of the proposed method. In the future work, we will further investigate 
the medical frontier topics and propose adaptive therapeutic methods to solve these 
issues by employing ADP approach. 
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Chapter 3 A) 
Optimal Regulation Strategy get 
for Nonzero-Sum Games of the Immune 

System Using Adaptive Dynamic 

Programming 


3.1 Introduction 


As the rapid increase of tumor patients, immunotherapies integrated with multi- 
pronged approaches are being burgeoning for treatment of cancers with specific 
forms, especially for poorly immunogenic tumors as [1]. The original intention of 
immunotherapy is fighting cancer cells with their own lethality of immune cells. 
AIDS as atypical immunodeficiency syndrome caused by failure of immune response 
tends to be attributed to weakened immune levels, however, Natural killer cell pop- 
ulation determine whether shutdown of immune system, once the activate immune 
system can not be suspended from and produce cytokines [2], which is regarded as an 
overreaction of the immune system such as COVID-19. Thus, the Nash equilibrium 
between the tumor cells and the immune cell population needs to be solved through 
optimal regulation based on specific learning method, and optimal control scheme is 
firstly brought into this field with its unique superiority, what’s more, nonzero-sum 
games-based ADP enjoys meliority and practicability. 

Decision and estimation on unknown nonlinearity existed so extensively in fields 
of engineering practice, medical treatments and even the social sciences, such that 
literature [3] firstly proposed the evaluation of the designed S-Box with highly non- 
linearity on the basis of Chinese I-Ching philosophy. It is of great importance to 
make a suitable treatment decision in the field of health care where remains highly 
nonlinearity. To obtain an optimal mixed treatment strategy, the growth model of cell 
population levels was developed based on combination of immune and chemother- 
apy as literatures [4, 5]. When it comes to reaction of the immune system to tumor 
growth, a rather complicated nonlinear model of the immune system is requisite to 
simulate the overall aggressive combination treatment plan of immunotherapy and 
chemotherapy well. Thus, the process of solving the nonlinear function is hardly to be 
achieved unless the application of exceptionally optimized iterative algorithm such as 
backstepping techniques in [6], self-learning optimal regulation in [7], hierarchical 
lifelong learning as [8], broad learning adaptive neural control in [9] and adap- 
tive dynamic programming, which benefits from its adaptive capability and strong 
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autonomous iterative learning ability [10, 11]. Whether backstepping or adaptive 
dynamic programming both could guarantee the control objective would be achieved, 
and unknown nonlinear function matched the value of successive searching approx- 
imation through neural networks or fuzzy control as literatures [12-15]. 

Heo control enjoys excellent disturbance suppression while minimizing perfor- 
mance index and it is recognized as a typical two-player zero-sum problem, which 
can be equivalent to solving algebraic Riccati equations, and it is generally applied 
into linear dynamics systems, of course, systems with quadratic performance index 
could be actually solved such as literature [16]. Meanwhile, the familiar Hamilton- 
Jacobi-Isaacs is perceived as an effective medium in dealing with systems consider- 
ing inherent nonlinearity, such as unknown mechanical parameters in [17], which is 
difficult to achieve using conventional methods for absence of exact system parame- 
ters. The mainstream analysis of ADP is seeking optimal control strategy integrated 
with solution to Bellman functions without information of system dynamics, which 
has ascended to the core methodologies of optimization and artificial intelligence. 
When it comes to actual models, control constraint has been definitely considered as 
[18-20], thus the chapter mainly focuses on dynamic model of the immune system 
which limits the single injection of drugs to an intervention level, and the optimal 
control scheme is transformed into constrained control which needs to take a dis- 
counted factor into account, avoiding infinite time dimension effectively, which will 
lead to development of optimal constrained control policy. 

Model-free adaptive control was developed to obtain optimal control strategy 
without knowledge of exact system parameters as literatures [21-23], and multi- 
ple neural networks were constructed to achieve multi-objective approximation or 
optimization control process. Research with respect to multiple networks has been 
extended to multitudinous actor-critic constructions. A tremendous amount of prac- 
tical application scenarios need multiple controllers, each of which minimizes its 
individual performance function as nonzero-sum problem. As elaborated in nonzero- 
sum game theory, the control objective was minimizing the individual performance 
function and maintaining stability to yield a Nash equilibrium in [24]. As literature 
[25], saddle point of the Nash equilibrium was explored throughout the nonzero- 
sum games-based optimization iterative process using ADP, even if there was no 
feasible saddle point, optimum was realized through mixed optimal control scheme 
iteratively, and the latter is of universal significance for conditions that are uneasy to 
satisfy in practical applications, The local optimal problem exits extensively which 
was firstly effectively avoided through fault-tolerant adaptive multigradient recur- 
sive reinforcement learning as [9]. To seek the solution to Nash equilibrium, the 
simultaneous algebraic Riccati or Hamilton-Jacobi-Isaacs functions require solving 
for nonlinear systems, which leads to “curse of dimensionality” with huge amount of 
computation, especially for multitudinous actor-critic constructions suffering from 
higher computational burden by many multiples, such as a double-loop policy itera- 
tion in [26]. According to the reason described above, the chapter adopts compromise 
acceptable actor-critic neural networks with appropriate dimensions, effectively real- 
izing the transformation process from value iterative to cost function. 
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Value and policy iterations generally constitute the whole iterative methods, and 
begin with an semidefinite function or admissible control law accordingly. With 
applications of ADP to solve the optimal control strategy for both continuous 
[27, 28] or discrete-time systems [29, 30], however, traditional ADP can not sat- 
isfy the physical application in the immune system considering the mixed treatment 
strategy with chemotherapy drugs and immunotherapy, improving matters somewhat 
by nonzero-sum games-based ADP. There are seldom any literatures on nonzero-sum 
games-based ADP method for solving optimal regulation schemes of the immune 
system, let alone considering optimal constrained control, policy iterations, tumor 
regression and mixed control strategy of chemotherapy and immunotherapy, scilicet 
the cost function approaching covers minimization of the tumor cells, chemotherapy 
drugs and immunotherapy drugs, simultaneously. 


3.2 Establishment of Mathematical Model 


This part mainly introduces the mathematical growth model of tumor cells, 
which considers the influence of external factors such as chemotherapy drugs and 
immunotherapy on the tumor cells, mutual effect between two types of cells. In 
the following model, Tu(t) represents the amount of tumor cells, [m(t) denotes 
the number of immune cells, and Che(t), Im py(t) depicts the concentrations of 
chemotherapy drugs and immunotherapy drugs in the bloodstream, respectively. 


3.2.1 Growth Model of Tumor Cells 


Individually considering the natural growth law of tumor cells without the relation- 
ship with immune cells and any external effect on them, the growth law of tumor 
cells is subject to logical growth. 


Tu(t + 1) = Tut) + C,;Tu(d — CoTu(t)). (3.1) 
But when it comes to the interaction between immune cells and tumor cells, the 
direct killing of cells by chemotherapeutic drugs, and the growth model of tumor 


cells can be revised to: 


Tu(t+ 1) =Tu(t) + Ci Tut) — CoTu(t)) 
— Crm, ruT u(t)Im(t) — Cohe,ruT u(t) Che(t), (3.2) 


where the specifications of parameters are demonstrated as Table 3.1. 
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Table 3.1 Parameter specifications of the tumor cells 


Parameters Relationships with the tumor cells 

Cı Intrinsic growth rate irrespective of the immune cells and chemotherapy 
drugs 

C2 Maximum capacity for interaction between tumor cells ignoring effects of 


immune cells and chemotherapy drugs 


Cim,Tu Growth rate when the tumor cells are inactivated to attack from the 
immune cells 


CChe,Tu Stress response coefficient of the tumor cells to chemotherapy drugs 


3.2.2 Growth Model of Immune Cells 


Considering the natural growth law of immune cells simply, we assume that a fixed 
number of immune cells are produced in a unit of time and that these cells have an 
inevitable life cycle. 


Im(t + 1) = Im(t) + C3 — Crm,aTm(t) (3.3) 


The tumor cells in the body can stimulate the growth of immune cells, which 
shows a positive non-linear change by (3.4). 


a,Tu(t)*Im(t) 


Aim = 
Bı + Tu(t)? 


(3.4) 


In immunotherapy, the addition of immune agents can produce an immune 
response, which leads to the non-linear growth of immune cells. 


a2Tu(t)Impy(t) 


3.5 
b2 + Impy(t) aa 


Aim,, = 


Simultaneously, in the struggle between immune cells and tumor cells, immune 
cells themselves can also cause losses, 


ACruim = —CTutmTu(t)Im(t). (3.6) 


and in chemotherapy, chemotherapeutic drugs can also cause damage to immune 
cells. 


ACchein = —CChe,tmChe(t)Im(t). (3.7) 


Combined (3.3)—(3.7), and then (3.8) can be obtained. 
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Table 3.2 Parameter specifications of the immune cells 


Parameters Relationships with the immune cells 

C3 Constant inflow rate 

Cim.d Natural decay rate without any external action 

al Maximum rate of recruitment caused by the tumor cells 

By Steepness coefficient caused by the tumor cells 

a2 Maximum rate of tumor cells caused by immunotherapy drug 
Bo Steepness coefficient caused by immunotherapy drug 
CChe,Im Stress response coefficient to chemotherapy drug 

Cru.im Reaction rate of tumor cells to the immune cells 


Im(t +1) =Im(t) + C3 — Crmalm(t) + Aim + Aim,, 
FAG sg. ACh 
a,Tu(t)*Im(t) 
Bı + Tu(t)? 


F Cru,imTu(t)Im(t) 


=Im(t) + C3 — Cimalm(t) + 


a2Tu(t)Impy(t) 
b2 + Impy(t) 
— Cche,tmChe(t)Im(t). (3.8) 


Parameter elucidation of immune cells are outlined as Table 3.2. 


3.2.3 Drug Attenuation Model 


We assume that at some point after the injection of a chemotherapy drug, the con- 
centration of the drug in the body will decrease exponentially. To guarantee the 
effectiveness of the treatment, we add chemotherapy drugs to the body, simultane- 
ously. 


Che(t + 1) = Drcpe(t) — e`" Che(t). (3.9) 

Similarly, we can obtain the attenuation model of the immunoagents: 
Impy(t + 1) = Drm (t) — e” Imp (t). (3.10) 
where injected at t, Drcye(t) and Dr7;,,(t) denotes concentrations of the chemother- 


apy drugs and immunoagents separately. yı and ‘72 is the decay rates of the chemother- 
apy drugs and immunoagents. 
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3.2.4 The Design of the Optimization Problem 


Combined with the contents of (A), (B) and (C), we finally obtain the mathematical 
model affecting the growth of tumor cells: 


Tu(t +1) = Tu(t) + C,Tu(t)( — CoTu(t)) 
—Cim,ruT u(t)Im(t) — Cche,TuTu(t)Che(t) 
Im(t + 1) = Im(t) + C3 — Cim aIm(t) 


a Tu(t)?Im(t) a2T u(t)Impy(t) 
+ Bi+Tu(t)? + Ba+Im,,() (3.11) 


—Cru,imTu(t)Im(t) = Cche,tmChe(t)Im(t) 
Che(t + 1) = Drcne(t) — e`" Che(t) 
Impy(t + 1) = Drin(t) — e YP Impy(t). 


Given that Tu(t), [m(t) are biomass, and Che(t), Im py(t) are the drug concen- 
trations in the bloodstream, 


Tu(t), Im(t), Che(t), Impy(t) = 0, Vt > 0. (3.12) 
And all parameters in the model are non-negative: 


Ci; C3; C3; Cim,Tu; CChe,Tu; Cim,d; CTu,Im; CChe,Im 
a1; O23 B1; B2; V1; Yo = 0, Yt > 0. (3.13) 


When we qualitatively analyze the problem that how to minimize the residual 
tumor cell population in the bloodstream on the premise of using as few drugs as 
possible, including chemotherapy drugs and immunoagents. This process can be 
described as a quantitative mathematical expression as (3.14). 


Drche(t) _ p 
min{aTu(t)? + n f tanh—'(U;'s)U,Rids 
0 
Drm (t) 7 E 
+ v | tanh—'(U;'s)U2Rods}. (3.14) 
0 


It is emphasized here that the single dose of the two drugs should be limited to 
avoid drug poisoning. So we use a definition form with input constraints. During the 
whole treatment process, we can get: 
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iy Drche(t) B _ 
Dmtatuy? +b, f tanh—'(U;'s)U,Rids 
0 


t=to 
Drim(t) 7 z 
+ b> f tanh! (Ü; 's)ŪRds}, (3.15) 
0 


where0 < à< 1, U 1 and U: 2 represent the maximum permissible dose of chemother- 
apy drug and dose of immune agents in a single injection, respectively. 


3.3 The Proposed Nonzero-Sum Games-Based ADP 
Scheme 


To solve the given problems above, we propose an aggressive treatment plan or 
control scheme based on nonzero-sum games-based ADP algorithm. 


3.3.1 Theoretical Introduction 


For a differential control system x(t + 1) = F (x(t), u(t), t)), x(t) is the state vari- 
able, u(t) is the control variable, F is the transition mapping between states, and 
then the cost of state transition is obtained: U (x(t), u(t), t), and the total cost of the 
whole period is ¥;4, U (x(t), u(t), t). 

When solving a finite time problem, we can equivalent it to 


ye NU (x(t), u(t), t),0<A<1. (3.16) 


In the application of Bellman’s optimality principle to solve (3.1), we first stipulate 
J (x(to)) = an NU (x(t), u(t), t), and then we can obtain that 


J* a0) = min {U[x(t), u(t)] + AJ“ + DI}, t € (to, œ). (3.17) 


The corresponding optimal control can be solved and the form as follows. 


u*[x(t)] = arg min {U[x(t), u@)] + AJ* XC + DI}.t € o,00). (3.18) 
u(t) 


This typical solution approach is a considerable challenge for computing and 
storage space. 
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Remark 3.1 Adaptive dynamic programming as an optimize learning method is 
usually used to track the cost function, which is not only designed to minimize the 
tumor cells, but also minimum dose chemotherapy drugs and immunoagents in this 
chapter. 


3.3.2 Iterative ADP Algorithm 


To solve (1), we use an iterative adaptive dynamic programming algorithm, and the 
revised facilitate solving differential equations model. 


(1) Brief interpretation of ADP algorithm 

Firstly, we take a value function K (x) to approximate the cost function J (x). In this 
case, the purpose of iteration is to ensure that the approximate function approaches 
to the optimal value equation and obtain the optimal decision law. Namely, 


| KERAN (3.19) 
k > u“. 
Secondly, in the specific solution process: 
Give K°(-) = 0, we make 
K°(x(t)) = arg min {U[x(t), u(t] + AK° (x(t + 1))}, (3.20) 
u(t) 
and update the value function as 
K' (x(t) = U[x(@), 6° (x(t) ] +AK œl + D), (3.21) 
fori = 1,2,3,..., we can get 
«i (x(t)) = arg min {U[x (t), u(t)] + AK (x(t + 1))}. (3.22) 
u(t) 
and 
Kit! (x(t) = man {U[x@), u(t)]| + AK (x(t + 1))} : (3.23) 
Thus, 


KIH! (x(t) = Ux (t), ki] + AK (x(t + 1). (3.24) 
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and the optimal solution is obtained when the error requirement has 
been adequately satisfied as condition that K ‘(x(t)) > K*(x(t)) and 
[K+ (x(t) — Ki ælt + 1))|| < £, where i represents the number of iterations. 


Algorithm : EvolutionaryADPalgorithm 

Initialization : 

1. A certain initial state is given randomly in the feasible region x(t); 

2. Set A°(-) = 0; 

3. Specific parameters are given according to the requirements: error €, 
discount factor A; 

IterationandUpdate : 

4. i = 0, substitute x(t) into “(3.26) = 0 ”, yield Ki (t); 

5. Plug x(t) and kÍ (t) into (3.25),and to get x(t + 1); 

6. According to the (3.29), calculate A’+!(x(t)) = 2¥ oe O) 4 Ai(x(t + 1)); 
7. According to the data set [x(t), Atl), 

the neural network of the relationship between x ~ A ; 

8. Using the neural network obtained by “7.”, the value in the same state 
is calculated. When | ait! (x(t)) — Al (x(1)) | < e ends; 

If it is not true, returns “4.”; 


To faster convergence to the optimal solution, we update in each iteration and 
value function, the control law according to the current direction of steepest descent, 
that is, 


OK (x(t) OU), KHO) Ox(t + 1),,OK' x(t + 1) 


cee Me EE? Va 
aK't!(x(t)) _ OU (x(t), Kit (t)) p pea +1),,OKi(x(t+ 1)) (3.26) 
Ont) kit! (t) Oni+1(t) Ox(t+1) ` l 


Setting A'a + 1)) = gD. 


(2) Modification of Model (3.11) 

Compared with the traditional control strategy, we directly solve the problem pro- 
posed in this chapter by using ADP, although it is difficult to solve the model. Here, 
we propose a fitting idea to modify the model. Analysis on (3.11) shows that the 
injection of chemotherapy drugs into the body has a direct effect on tumor cells. 
On the other hand, immunoagents act on immune cells, which affects tumor cell 
populations. Throughout the whole action process, we can only consider the input of 
chemotherapy drugs and immunoagents at every moment as the two control inputs 
of the system and the state variables of the system are selected as the intermediate 
transition variables such as tumor cells and immune cell population. 

1. The standard expressions of control variables, state variables, cost functions and 
so on are given as follows, 


x(t) = Tu(t), ui (t) = Drene(t), u2(t) = Drim(t). (3.27) 
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o9 u(t) _ _ 
K(x) = SON fax(ty + f tanh! (U7 's)Ū,R; ds 
0 
t= 


u2(t) = B 
+ b> f tanh! (Ü; 's)Ū2R2 ds}. (3.28) 
0 


2. The modified system model adopts the form of nonlinear affine system, namely: 


xt + D = FEO) — E), pau A), u2. (3.29) 


3. Update the optimal control law and value function: 
OKH (x(t) OKH (x(t) 
Let Ou} (t) = 0 and Ou}, (t) H 0, 


ut* (t) = Ūitanh( naM El + 1))), (3.30) 


b,U,R, 


u5*(t) = Urtanh( DAMA (x + D). 3.3) 


byU2R> 
From this, we can also get 


OK'* (x(n) 


— alti = 
A = At! (x) =À 


df&œ)  ,dgi(x)  ; dgx) 
IG “a CEE. 

XxX xX 
- Al (x(t + 1)) + 2ax. (3.32) 


Remark 3.2 To approximate optimal value based on optimal decision law, value 
iteration method is devoted to tending to the cost function J (x) through value function 
K(x). 


Remark 3.3 The fitted curve is constructed according to date obtained from the 
original model which is uneasy to solve, and the modification of model is research 
objectives for replacement, considering control inputs as chemotherapy drugs and 
immunoagents, simultaneously. 


3.3.3 Convergence Analysis 


This section provides proof of the convergence of this algorithm to prove the effec- 
tiveness of the algorithm in theory. This proof is mainly derived from formulas (3.1), 
(3.2), and (3.3), including two lemmas and three theorems. 
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Lemma 3.4 Take a control sequence {Ar (x(t))}. When it is brought into for- 
mula (1), the corresponding value function Ji œ) is obtained. Compared with 
the control sequence {i (X(t))} corresponding to the minimum cost K'(X(t)). If 
JLO = KC) =0, J) = URC), Ari EA] + AJ), RCE + D), satisfying 


KFR) = URA, K EAMH EGH D) 


min [URA Ar()] + AK' Z + D)} (3.33) 


Then, Ji,(X(t)) > Ki(&() for Vi. 


Proof K'(X) is obtained by taking the minimum value equation Ji, (x). {K (X(0))} 
is the corresponding optimal control sequence. For the arbitrarily control sequence 


{ Ar’ (x(t))}, the value equation J ae (x) which is corresponding with the arbitrarily 
control sequence must not be less than Kİ (x). 


Lemma 3.5 Select a stable admissible control sequence {Sa (x(t))} with certain 
restrictions and the corresponding value equation J}, (¥). For controllable system, 


if JLO = K?) =0 and JM) = ULM, Sa EO) HAGE + D), Then 
Ji (&) is bounded. 


Proof 


Jit EQ) = URO, Sa EO] + Ifa EE + VD) 
= URO), Sa ROI + ULE), Sa’ Ee + YI 
+ JG (RE + 2)) 
= UIR (t), Sa’ I+ AURE), Sa’ E(t +. D) 
+ XUBE +2), Sa’ R(t + 2D) 4+... 
+N C+: + 1). (3.34) 


Thus, JEEE) = Fio NURE + j), Sa’ “C+ A) and JENRO) < 
lim; Xio YURE + j), Sa (&(t + j))], where {Saé (%)} is the stable allow- 


able control sequence, and we can get an conclusion that 0 < JX) < 


lim; >o Z} o NUL + j), Sa’ @@ + j)] < C for given constant C. That is, 
J} (¥) is bounded. 


Theorem 3.6 From formula (1), { ki (x(t))} is the control sequence corresponding 
to the minimum value function K' (X). Assuming the initial state K'(-) = 0, it can be 


proved that the sequence { ki (x(t))} is a monotonic non-decreasing sequence, and 
K'(X(t)) < KRO). 
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Proof Define a value equation T} (X (t)) : T'(-) = 0, T*' (4) = ATİ ($ (t + D) + 
U[x(t), PIHE E))]. When i = 0, TEE) = URE, FORO] + ATA + 1), 
TIEA) — TEO) = UE, POA) = 0, we get T1) = TRO). 

Assuming ¢=i-1, TRO) > TIEA), When t=i, T 
(x(t)) = U[x(t), EZA) + ATi x(t + 1) and Tit x(t) — TI) = 
AURA, ET + 1))]) > 0. Then THIRA) > TİZ). And we can get 
KİZA) < KRO). 


Theorem 3.7 It is known that { ki (x(t))} is the control sequence corresponding to 
the minimum cost function K' (x), which can prove lim; K'(X(t)) = K* (X(t). 


Proof {«'(X)}and Kİ (x) have been given in Lemma 3.2, and the corresponding value 
function of {«i!(x)} is K+ (X(t) = ULX(), R(X) + AK (XD), where / is 
the length. Obviously, K't!(x(t)) = Po A UJE C + j), KS (xt + j). 
After taking the limit, we can obtain K™!(X(t)) =limj.. Ya 
NURE + j), RI E+ j], and define K*(X(1)) = inf{K EH). 


Similarly, 2+! (x(t)) < K! (X (t)) < D5 can be obtained from Lemma3.5. 
On the other hand, we get K‘+!(x(t)) < KS (X (t)) based on Lemma3.4. There- 
fore, it can be concluded that K't!(x(t)) < RITHE) < RZ ()) < D’. 
K*(x(t)) = inf K®+(x(t)) with the definition of minimum value for the optimal 


value equation, extracting a control sequence {Ki} so that K°" < K*(x(t)) + €, 
and drawing an conclusion that K” < K*(x(t)) + e. Considering K'*!(x(t)) < 
KIHI (Z(t) < K®!(¥(t)) < D! in another way and taking the limit, the formula 
holds for any i, L, then lim;_,.. K‘(X(t)) = inf DS. 


To guarantee lim; o Kİ (X (t)) = K9 (3 (t)), the control sequence {x9} is nec- 
essary, and then we can get Kİ+!(¥(t)) > K*(X(t)). Combining both aspects above, 
lim;_, 5, K'(X(t)) = K*(X(f)) is obtained. 


Theorem 3.8 For any state variable X(t), the optimal value equation K' (X(t)) sat- 
isfies the characteristics of the HJB equation. 


K*(x(t)) = U[X (t), K®)] + AK* (X(t + 1). (3.35) 


Proof From the proved lemmas and theorems, a series of characteristics about 
“K'(x(t))” are obtained. At this time, it is necessary to verify that characteris- 
tics of the HJB equation are satisfied. According to (3.23), there exits K*(x(t)) = 
inf {Us (t), K] + AKİ (X(t + 1))}, meanwhile, according to Theorems 3.6 and 3.7, 


yield that K'*!(x(t)) = min{U[X (t), K] + AK'(X(t + 1))}. Then take the mathe- 
K(t) 
matical limit, we get K*(x(t)) < inf U[X (t), R] + \K‘ (X(t + 1) for the randomness 
R(t) 
of {u(t)}). 
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From the other side, we have K‘t!(x(t)) > inf U[x(t), R] + AK’ 1x + 1), 
K(t) 
take the limit again, then yield that K*(X(t)) > inf U[X (t), K] + AK‘ (x(t + 1)). As 


to the analysis above, we can get a final conclusion. 
K*(x(t)) = UIO), K()] + AK* Œ + 1)). (3.36) 
All content is verified. 


Remark 3.9 The control sequence [ki (x(t))} is a monotonic non-decreasing 
sequence corresponding to the minimum value function K'(x), which tend to be 
K*(x(t)) eventually, satisfying the characteristics of the HJB functions as [31]. 


3.4 Simulation and Numerical Experiments 


In this section, we consider the mechanism model of tumor cell growth combined 
with immunotherapy, chemotherapy and combination treatments proposed as experi- 
mental validation. Firstly, The affine system model is constructed with chemotherapy 
drugs and immunoagents as control inputs and the account involved of tumor cells 
as state variables. Secondly, according to the affine model obtained by fitting, we 
developed the cost function of treatment loss with the clinical treatment requirements. 
Finally, the optimal treatment plan for a patient with a basic condition is given after 
calculation by the algorithm. 


3.4.1 An Affine Model of Tumor Cell Growth 


According to clinical medical statistics and literature [4], the specific parameters of 
the mechanism model are given as Table 3.3. 

At this point, when we give the initial count of tumor cell population and immune 
cells in a patient and follow a certain chemotherapy and immunotherapy regimen, 
we can get the following four curves on tumor cells and immune cell population 
as shown in Figs. 3.1 and 3.2. It is obviously that state variable Tu(t) denoted the 
population of tumor cells tend to be stable in Fig. 3.2, similarly, for Jm(t) in Fig. 3.1. 

When the fitted affine system is carried out according to the data obtained from 
the mechanism model, Drcne(t) and Dr7,,(t) are selected as two control inputs and 
Im(t) as state variables. Within the allowable error range, the obtained fitting relation 
is shown as the following form, 
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Table 3.3 Concentration variation on immune cells, tumor cells, chemotherapeutic drug and 


immunoagents 


Parameter Estimated Units Parameter Estimated Units 
value value 

Ci 0.00431 day~! C2 1.02 x 107? | cell~! 
C3 0.33 cell- day™! | Crm, Tu 6.41 x 107!! | cell™! 
CChe Tu 0.08 day~! Cima 0.204 day 
Cru,im 3.42 x 1076 | cell“! CChe,Im 2x 107! day 
ay 0.0125 day~! a2 0.125 day 
By 2.02 x 107 cell Bo 2x 107 cell 
yı 0.1 day™! 12 1 day 

S 

T 

= 

3 

v 

S] 

5 

E 

= 

= 


Fig. 3.1 The curves of tumor cells 


Time (t) 


100 


xlt +D = FEA) EO), pau t), u. (3.37) 
f(x) = x + 0.00431x(1 — 1.02 x 107°x). (3.38) 

gı (x) = exp(8.15 x 10~°[log(x)]*!3! + 3.482). (3.39) 

ga (x) = exp(0.05639[log(x) 7"? + 2.492). (3.40) 


The curves before and after fitting are compared as Fig. 3.3, which meets the 
requirements of fitting precision, which guarantees accuracy of the data traced back 
to the original source. 
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Fig. 3.3 The curves of immunoagents drug concentrations in the bloodstream 
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3.4.2 The Treatment Loss Cost Function 


The form of the cost function proposed in the third part as (3.17). Unlike the the- 
oretical mechanism model analysis, and combined with clinical requirements, it is 
necessary to limit the single injection of drugs to no more than 0.05. Therefore, 


U; = 0.05, U> = 0.05. (3.41) 


To avoid the optimal solution in the infinite time dimension, we choose the dis- 
count factor \ = 0.95. Finally, the specifically obtained cost function as follows: 


00 u(t) 
K(x) = > 0.95' {2.784 x 107° x(t)? +f 50tanh! 
0 


t=fo 


u(t) 
(0.057!s)ds + f 850tanh~! (0.057! s)ds}. (3.42) 
0 


3.4.3. The Optimal Solution of the Treatment 


According to the previous two subsections, we have completed the transformation 
from the mathematical mechanism model to the solvable affine model, and deter- 
mined the specific value of the cost function according to the clinical requirements. 
The optimal treatment strategy is acquired through the proposed algorithm and make 
a comparison to prove the effectiveness and feasibility. The cost function is designed 
to minimize the tumor cells, meanwhile, there exit minimum dose chemotherapy 
drugs and immunoagents. 

In the following three figures (Figs. 3.4, 3.5 and 3.6), the blue curve represents 
the changes of tumor cells and the changes of a single dose in patients under the 
normal treatment regimen. In contrast, the red curve represents the optimal treatment 
regimen’s effect calculated by the nonzero-sum game-based ADP algorithm. 

As shown in Fig. 3.4, there are originally many cancer cells in the body. The two 
curves are close to the upper limit, with drugs and dual function of the immune 
system, a substantial reduction in the number of cancer cells. The amount of drug 
injection therapy hasn’t changed greatly during the process from beginning to end. 
Even in the closing stage, cancer cells decreased significantly, there are still specific 
doses, and we solve the treatment dose is substantially less than the former. 

Correspondingly, as shown in Fig. 3.5 that the changing trend of the injection dose 
of immunoagents on the two curves is close to the changing direction of chemotherapy 
drugs. The optimized treatment is slightly more than the traditional treatment plan 
when more cancer cells are in the initial stage, but it will not last for a long time. 
When the number of cancer cells is relatively large, the primary or indirect target of 
these two drugs is cancer cells; then, in the late stage of treatment, the number of 


3.4 Simulation and Numerical Experiments 


0.05 T T 


—— Without taking the optimal treatment 
Taking the optimal treatment 


0.04 F 


Dr che 


0.02 F 


0 i i i i 
0 20 40 60 80 100 


Time (t) 


Fig. 3.4 The injection dose curve of chemotherapy drugs under two kinds of treatment 
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Fig. 3.5 The injection dose curve of immunologic agents under two kinds of treatment 
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Fig. 3.6 The curves of tumor cells under two kinds of treatment 


cancer cells is significantly reduced. If the chemotherapy drugs are put in according 
to the normal treatment, the normal cells will suffer a lot of erosion, which has a 
more significant impact on the body. However, the optimized drug dose has been 
dramatically reduced, and the normal cells have been less affected. 

As shown in Fig. 3.6, control effect of the two treatment schemes on the number 
of tumor cells enjoy resemblance to that in the initial stage. Still, at the final stage, the 
algorithm optimized by ADP not only significantly reduces the count of tumor cell 
population, combined with Figs. 3.4 and 3.5, but also minimize the injection amount 
of the two drugs, which shows the effectiveness of our treatment scheme. 


Remark 3.10 The optimal regulation strategy for the immune system enjoys advan- 
tage of decreasing of tumor cells, what is more, clinical treatment benefits from 
typical minimization of chemotherapy drugs and immunoagents. 


3.5 Conclusion 


Nonzero-sum games-based adaptive dynamic programming has been proposed 
acquiring the optimum through affecting the growth of tumor and immune cells, 
providing guidance for clinical practice through adjusting the administered doses of 
chemotherapy drugs and immunotherapy drugs. Obtained results have shown that the 
immune system can decrease the tumor cells, meanwhile, minimizing of chemother- 
apy drugs and immunoagents through optimal control behavior. Simulation examples 
have presented availability and effectiveness of the research methodology. The future 
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research will focus on solving the optimal mixed treatment strategy taking account 
of complex immunotherapy system including immune cell subsets and cytokines, 
considering the switched control policies in according with hybrid therapy. 
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Chapter 4 A) 
Evolutionary Dynamics Optimal get 
Research-Oriented Tumor Immunity 
Architecture 


4.1 Introduction 


Interaction between cancer cells, surrounding stromal cells and immune cells through 
autonomous and non-autonomous signaling can influence survival competition. 
Therefore, it is very critical for evolutionary and ecological dynamics mechanis- 
tic understanding of tumor progression [1]. It is assumed that evolution causes traits 
to change continuously over time even if the ecological dynamics are constantly 
changing. More broadly, imagine an evolutionarily stable state that is a trajectory 
of phenotypic states-an evolutionarily stable trait attractor. This can be used in sce- 
narios where there is sufficient variation to facilitate rapid evolution, or where the 
state involves a plastic response to environmental conditions, eventually constituting 
evolutionary stability. Simultaneously, Natural killer (NK) cells as one of the players 
in the game attack many tumour cell lines, which is critical in anti-tumour immunity 
[2], however, the interaction between NK cells and tumour targets is poorly. To over- 
come drug resistance, anti-tumor immunotherapy gradually replaces the traditional 
treatment strategy [3]. The interaction between specialized cancer cell populations 
and immune cells has become a special evolutionary dynamics phenomenon in the 
process of tumor immunity growth architecture. The goal of optimization is to min- 
imize administration dosage and reduce negative effects. 

The dynamic perception or learning process is realized through interactions 
between cells and organism architecture, accomplishing observing their responses 
and learning optimal control strategy ultimately of Markov decision. It is required 
to seek an optimal control scheme such that the desired dosage of administration 
can be tracked and the optimal performance of minimize chemotherapeutic drugs 
and immunological agents can be achieved. Thus, reinforcement learning is urgently 
needed for optimal research-oriented tumor immunity architecture. The classical 
policy iteration and value iteration frameworks are never out of date, and the new 
min-Hamilton function [4] and the low-gain parameter ADP-Bellman equation for 
global stabilization are thriving [5]. 
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The interaction between cells is highly nonlinear and coupled. When the compu- 
tational conditions allow, whether it is the adaptive algorithm design based on policy 
iteration, or the adaptive hierarchical neural network algorithm [6], which can easily 
solve the coupled fractional order chaotic synchronization problem. All inspire us 
in solving the optimal solution of the HJB equation of the idea. Once computing 
conditions are not available, model-free is the best idea. The iteration-ADP algo- 
rithm is developed into iteration-NDP algorithm, which does not require an accurate 
system model [7], but only requires observable system data, which can reduce the 
cost and optimize the control action in the process of error backpropagation [8]. 
The emergence of Q-learning, from containing three classes four networks to inter- 
leaving double iteration, and then to the critical Q-learning [9] of a single class one 
network, effectively improves the utilization of resources, and the problem of insuf- 
ficient exploration no longer exists. The interaction between cells coincides with 
multiple agents, and the attack of tumor cells on normal cells may cause abnormal 
reactions, and the neural net-based attack detection and estimation scheme designed 
by [10] can easily capture such anomalies. Cells cannot proliferate without limit. 
When solving the optimal solution of the constrained auxiliary subsystem, based on 
the framework of ADP, the idea of pi iteration is continued, and a strong convergence 
synchronous iterative optimization strategy [11] is given. 

The difficult-to-decouple leaderer-follower behavior of vehicle-vehicle commu- 
nication [12], human-vehicle interaction, and mutual quality of everyone can be 
easily solved with off-PI [13]. Switching system [14], T-S fuzzy, nash equilibrium, 
zero-sum game [15], let each agent deal with a low-dimensional state and local pat- 
tern, reduce conservatism, can easily obtain the minimum local cost [16]. Influenced 
by the improved exploration feature, the parallel A-C asynchronous gradient shar- 
ing mechanism can realize the parallel optimization operation of diversified agents 
in a short time [17]. Affected by the time difference error, integral reinforcement 
learning can obtain the estimated control strategy by updating the critic weight [18, 
19]. In order to obtain a better stabilizing adaptive control scheme, it is necessary 
to give an appropriate robust control scheme for the control system [20]. Reference 
[21] summarizes the recent outstanding progress in the continuous nonlinear control 
system of the controller that combines adaptability and robustness. The reliability 
and effectiveness of the actual power system and some large machinery and heavy 
machinery devices with these two designs considered are also demonstrated. The 
theory integrates ecological and evolutionary dynamics blending ecological mathe- 
matical model evolutionary game theory [22]. Then evolutionarily stable strategies 
will be investigated to seamless integration of both sides [23]. Solvable dynamic 
equations can be used to explore optimal control objectives, however, what followed 
is a disaster of dimensions. 

To overcome it, dual-heuristic dynamic programming is proposed for the nonlinear 
affine evolutionary dynamic dated from ADP considering the actual constraints. By 
introducing a discounted performance index, the optimal regulation problem of the 
infinite dimensional problem is reformulated into a finite dimensional. Different from 
previous value iterations which requires a strategy for initially stable the system. ADP 
is conformed to the optimal formation control by the establishment of performance 
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index function [24]. The affine mathematical model is firstly introduced to twinborn 
the real scenario [25]. The optimal control is transformed into pursuing solution of 
HJB, and the convergence is proved. ADP involves learners giving rise to learning 
strategy, and the author studied a competitive learning system setting with cancer 
cell populations and immune cells, aiming at minimizing the dose administered. 


4.2 Pre-knowledge 


Consider a classical discrete-time nonlinear affine system, 


xE + 1) = H) + gaeut) (4.1) 


where the state variable x(t) € R”, the control variable u(t) € R”, and f(-) € 
R”, g(-) € R”*” can be stabilized on a compact set Q € R”, and f (0) = 0 g(0) = 0. 
Colloquially, the optimal control problem of (4.1) is equivalent to obtaining u* (t) = 
u(x(t))(the optimal control law) that minimizes the proposed infinite-horizon per- 
formance index: 


J(x(t)) = Y KEH, uO). (4.2) 


t=0 


K(x(¢), u(t)) is the cost function, K(x, u) > 0 Vx, u. Basically, the cost function 
K(-) is given a quadratic form 


K(x(£), U) = x" (Px(t) + u” HQu(t) (4.3) 


P, Q > 0 are all positive definite matrices. 

The optimal control problem of (4.2) can be converted to solve the HJB equation. 
According to the Bellman optimal principle, the optimal value function should obey 
the following[9]: 


J(x®) =min{x" OPx (O+ u’ (t)Qu(t)+ Txt + D} (4.4) 


By minimizing the right side of the (4.4) to solve the optimal control law, get the 
optimal value function J*(x(¢)). For necessity, one can take the partial derivative of 
the right-hand side of (4.4) with respect to u(t) to obtain u*. Hence, 


T roy" 1 
wo=-F [peo] So (45) 
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Take (4.5) into (4.4), it can be obtained that 


1, Os* 1 
FaH) =x" (HPX) + e 


j ] seo" 
|+reu+ Dp. (4.6) 


OS* (x(t + 1)) 


COL py 


By the on (4.6), it is almost impossible to obtain an analytical solution for u* (£). 
Impossible in the current moment ¢ can know the next moment J* (x(t + 1)). To over- 
come this dilemma, the approximate optimal solution of HJB equation can be studied. 
In the fourth part of this chapter, the derivation of IDHP algorithm is introduced to 
solve this kind of optimal control problem [26, 27]. 


4.3 Modeling of Mixed Immunotherapy 
and Chemotherapy for Tumor Cell 


In this part, a mathematical model is constructed from the natural growth of a single 
type of tumor cells, the gradual increase of the interaction between various immune 
cells and tumor cells in vivo, and the influence of external application of chemother- 
apy drugs and immune agents on the population of tumor cells [22, 28, 29]. 

First, define the acronyms of various cells: 


J, (t): Tumor cell population in the vivo. 

Nx (t): NK cells are derived from bone marrow lymphoid stem cells. 

Cy (t): Cytotoxic T lymphocytes (CTL), a subdivision of leukocytes, are specific 
T cells that secrete various cytokines and participate in immune function. 

C(t): Number of circulating lymphocytes (or leukocytes). 

Chg, (t): Chemotherapeutic drug concentration in the blood. 

ama (t): Immunotherapy drug concentration in the blood. 


For the convenience of writing, the following subsections do not specify the time, 
and the default is +, lowercase letters “a, b, c1, c2, €, f, g, 61, 62,13, M, ni, n2, pr, 
P2, q1, q2, €, 5, U” all represent fixed real numbers; Uppercase letters “G, K, O, R, I” 
represent different categories of gain items, which depend on time ¢; £(.) is a constant 
that depends on the cell type; and e stands for exponential functions. 


4.3.1 The Natural Growth of Cells 


According to the [2, 22], the increase of tumor cells follows a natural growth curve, 
Gs, = aF,(1 — ba) (Qo represents the natural growth tescr operator of all types 
of cells). Natural killer cells [22] are assumed to be produced at a constant rate 
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and to be influenced by circulating lymphocytes throughout the production cycle 
(since circulating lymphocytes represent the overall level of immune health), and 
thus,GQn,, = ¢1C¢ — c2. In the absence of tumor cells, Cytotoxic T lymphocytes 
are assumed to be absent and cell growth of Cy (£) cells is only affected by natural 
mortality, Ge, = —eCz. Circulating lymphocytes are also produced at a constant 
rate during their lifetime, Çe, = f — gC. It is set that when the body is injected 
with chemotherapy drugs or immune agents, it will show exponential decay, Ge,,, = 
—e Cha, Camp = —e Imy,. 


4.3.2 Intercellular Conditioning 


When the above cells exist at the same time, there will be a negative interaction 
between the two populations, partly due to the competition for growth space and 
nutrients, and this indirect effect. The other part is the direct resistance of cell pop- 
ulations to each other [22]: 


(Cz /Fu)" 


K = jn I, Ke, = eae o aaa 7? 
Se le: ae ar fae 


And just to simplify the writing, let’s write © for a particular term, and notice 
that © = O(t), which is related to Cz (t), Ta (£). 


(Cz /Fu)" 


O = . ——— 
Di C/T 


Ke, = 0 -Jy (4.7) 


NK cells have the function of recruitment, which is to design sequential application 
methods of cell cycle non-specific drugs and cell cycle specific drugs, recruit more 
cells at specific stages into the proliferation cycle, so as to increase the number of 
tumor cells killed [29-31]. 


[. A Og, 
Ns Re (Su, Cr) = p1 


Rn, = ae 
ue MI? qit OR? 7 


Cy cells have a similar recruitment effect [32]. It is directly proportional to the 
number of cells killed by NK cell lysis of tumor cells, Re, Ng, Ta) = MNg Ta. 
Also, the presence of tumor cells stimulates the immune system to secrete more 
cells, Re, (Cre, Ju) = n2Cc Ta. In the immune function, NK cells or CD cells may 
have to undergo multiple contact with tumor cells, and then inactivate [29, 33-35]. 


Gacy = —P2TWMG Jaces =—-NCtIn Jere; = Ng (Czy 
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4.3.3 Drug Intervention 


All kinds of cell populations in this model contain the action tescr of chemotherapy 
drugs, and the killing effect of chemotherapy drugs is not always effective. At low 
drug concentration, the killing rate increases almost linearly, while at high drug 
concentration, the killing rate tends to be stable. Saturation type is used to describe 
them in the model [36], 1 — e“*®, 


De) = Lod = EO) 


(-) = Tus Cr, Cy, Ny. 

L represents the interaction coefficient between corresponding cells and tumor 
cells. It also includes immunotherapy, whose impact on immune system efficacy 
can be mathematically described by the Michaelis-Menten interaction, $, u are the 
constant [30]. 

Ging, Cz 


DI" (Cy, Img) = u —— 
r (Cr, Sma) s + my 


Chemotherapy and immunotherapy drugs are injected in a certain period of time, 
and denote by Vcne(t) and Vrm (£) the amount of chemotherapy drug injection and 
the amount of immunotherapy drug injection, respectively. 


4.3.4 Mixed Growth Model of Cell Population 


Combined with the above contents, the total cell population growth model can be 
obtained: 


malt +1) = (1 — 7) Img, (t) + Vim E) (4.8) 
Chea(t +1) = (1—e*)Cha(t) + Uche(®) (4.8b) 
Celt +1) =f- Le, + -MCc® - Le, (4.8c) 
Tut + 1) = (+a LIRE) — 65,7) 
+ THe -mH - oe] (4.80) 
Cg (t + 1) = (1 — e — Le, )Cr (£) + [iM ) — q2€z (t) 
+ MELON KO) — MACE) + Le, Cr O (4.8e) 


ugma (t) ORON MAC) ] 

s+ Im q + OOW O 
LRO 

m+ F(t) 


+ [Line = poIu(O [Me + CeO) (4.88) 


eO 4 eol 


NiE +1) = —Lr, + (1 — cN) + Ny (t) 
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4.4 Iterative-Dual Heuristic Dynamic Programming 
Algorithm for Mixed Treatment 


The optimal control problem has been transformed into solving the HJB equation 
(4.4). In this part, a constrained iterative dual heuristic dynamic programming algo- 
rithm based on mixed treatment is given. The algorithm is derived from adaptive 
dynamic programming [26]. This part mainly three parts research content are pre- 
sented as working mechanism of ADP algorithm, structure of constrained iterative 
dual-heuristic dynamic programming algorithm and proof of convergence on I-DHP 
algorithm. 


4.4.1 Working Mechanism of ADP Algorithm 


Generally speaking, for unconstrained control problems, the performance functional 
(4.3) is usually chosen as the quadratic form. In this chapter, considering the actual 
constraints, is transformed into solving a bounded control problem, adopted a non- 
quadratic functional as follows: 


u(t) 
Y(t) = x"(¢)Px(t) +2 f tanh” U 's)UQds 
0 


It is convenient for mathematical calculation avoiding the loop or unlimited create 
unlimited returns markov decision process. In the loop or unlimited markov process 
which will constantly get reward again and again, so we need to add discount factor 
to avoid infinity and infinitesimal value function,By introducing discount factor , 
an infinite dimensional problem is transformed into a finite dimensional problem, 
O<A<l1. 


IO = X AYO, uD) = YO 
l=t 


+A > NPY (x (1), uD) (4.9) 


l=t+1 


According to the Bellman optimality principle, the optimal value function satis- 
fies: 


u(t) a 
Je) = min |x" @)Px(0) +2 f tanh” (U's): 
u(t 0 


UQds + A(x(t + D). (4.10) 


60 4 Evolutionary Dynamics Optimal Research-Oriented ... 


In the ADP algorithm structure, it iterates according to the policy iteration, select- 
ing T’ (x) as the approximation function and ø' (x) as the corresponding control law. 


The whole iterative process is as follows: 


1. Let the initial value function be T°(-) = 0 (which is far from optimal) and compute 


the control law at “+. = 0” as follows. 


u(t) 
gaH) =arg min{x"()Px () +2 f tanh-"(U's) 
u(t) 0 
-UQds + AT(x(t + D} 


2. Get T'(œ(5): 


aH) 
T(x(t)) =x (OPx(t) +2 f tanh"(U SNU 
0 
. Qds + AT(x(t + 1)). 


3. And for ¿ = 1, 2,3,--- 


u(t) 
@(x(t)) =arg min] x"(QPx (t) +2 f tanh-"(U s) 
u(t) 0 
-UQds + AT (x(t + h 


4. The iterative value function is obtained as follows: 


aW) 7 
TA) =x? (Px) +2 f tanh "(U s) 
0 


UQds + AT(x(t + 1)). 


4.4.2 Structure of Constrained Iterative Dual-Heuristic 
Dynamic Programming Algorithm 


(4.11) 


(4.12) 


(4.13) 


(4.14) 


In the dual heuristic dynamic programming, the assumption is that the value function 
is smooth, modelled on the (4.5), the partial derivatives (4.14) on the right side of 


9(x(t)), can get [37]: 


ITD) dfx (DPx ESIA tanh" 's)UQds| 


Ou(t) Ou(t) 
OT@E+)) _ 4 
Ou(t) Eá 
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And, for. = 0, 1, 2,--- 


i o —AàÀ rôx(t+1)7: ôTx(t + 1)) 
aA) = Tanh( 5 | a | ack ) (4.15) 
Do the same with (4.14) respect to x(t), 
OT a) Ox(t + 1)77 OT(x(t + 1)) 
ax(t) Pre) + Af ax(t) | ax(t +1) cn) 
As can be seen in (4.15) and (4.16), both have OE) compared to T’(x (t)) 


Ox(t + 1) 
in (4.14), DHP algorithm evaluates and updates the first partial derivative of the value 
function. 
The specific algorithm structure is as follows: (set costate function C’(x(f)) = 


OT"(x(t))/Ox (0). 


4.4.3 Proof of Convergence on I-DHP Algorithm 


The convergence proof of the algorithm shows that with the increase of the number 
of iterations, the evaluation and update between (4.15) and (4.16) are continuously 
completed, and the termination condition can finally be satisfied and the optimal 
solution can be obtained. 

The corresponding lemma needs to be given before the formal theorem proving. 


In order to facilitate writing, abbreviated “2 ti tanh? (UW 'sYUQds” to “H(u(f))”. 


Lemma 4.1 Assume that ¢'(t) is the control sequence calculated by (4.13), T' (x) is 
the value function calculated by (4.14). \'(t) is any admissible control sequence in 
the domain, and SX (x) is its corresponding value function equation, 


UAA =x" AOPE HCE) + AQ + 1)) (4.17) 


and it is easy to obtain: 


Tf QY-) = TX) = 0, then 0 < T(x) < Q(x), Ve. 


Proof The conclusion is obvious. T'(x) is the minimum value that can be obtained 
on the right side of (4.14), and g‘(f) is the corresponding control sequence. And 
Q'(x) is any admissible value function, so it must be not less than T' (x). | 


Lemma 4.2 Given that T'(x) by the (4.14), and if the system is controlled, then 
T(x) has an upper bounded 3 (a constant). 


0<T(x) <3,v 
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Algorithm 1: Procedure of the I-DHP algorithm: 


INITIAL: 

1. Select a smaller positive number e, initial iteration index for ¿ = 0, C%-) = 0. 
CALCULATION: 

2. Calculate the control law at the Oth iteration: 


u(t) 


9 (x(t) =arg min x7 (Px (t) +2 f tanh-"(U's)UQds + AT + D) 
u(t) 0 


—À Ee 1) 


= Ttanh( UOL ou 


Too 
| Ca@e+ 0) 
3. Update the costate function for iteration 1: 


Ox(t + 1) 
Ox(t) 


CLEO) = 2Px(t) 4 Al Jee+ D) 


4. Similarly, the control law for the e iteration: 


u(t) 
9(x(t)) =arg min x" (Px E) + 2 | tanh U sUQds + AT(x(t + D) 
u(t) 0 
=) Es +1) 


= Utanh( le 


a Jce C+D) 


5. Obtain the costate function for iteration + + 1: 


Ox(t +1) 
Ox(t) 


CHAE) = 2Px(t) 4 Al Jct 


COMPARATION: 

6. If | CHAE) — CaA) | < €, stop and get the approximate optimal control law 
9'(x(t)); 

Else, let ¿ = ų + 1, and jump to the 4. 


Proof Set »'(t) to be an admissible and stabilizing control sequence and V’'(x) to be: 
V(x) =x" (APx(H) + HO E) + AVE + 1)) 


Then, it can be obtained: (V-) = T'(-) = 0) 
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V(x) =x" HPE) + Ho) + WEED) 
=x" (HPX (£) +H E) + NEK +1)P 
x(t + DHO E+ yfexvree 9) 


=x" (t)Px(t) + Ho) + NEK + DP 
E+ DHHG E+ D)|+ 


+ [e + 1)Px(t + 1) + HOE + d| 
+ATVYx(t + + 1)). 


VHE) = Djo Ak E +DPxE +D + HOE +D)] < limo f Dio Xe" + 
DPx(t +1) + HOE + D) |} 


Due to the admissible control sequence v(t), it has an upper bound 3 that 


(x) < lim | yon EG + DPx(t + DHH (E+ D)| | <3. 
l=0 


Combined with Lemma 4.1, it can be obtained the result. E 


Theorem 4.1 For the iterative cost function T' (x) which follows (4.14) and its cor- 
responding control law ø' (t) obtained by (4.13), it can be concluded that with the 
increase of the number of iterations, T' (x) will converge to the optimal value function 
and ø' (t) will converge to the optimal control law, i.e., T' (x) > J* (x), Ø (t) > u*(t). 


Proof From Lemma4.1, Q(x (¢)) is the cost function corresponding to an any admis- 
sible control sequence ! (£), with 2°(-) = 0. 
Firstly, ¿ = 0, 


Ta) — VEE) = x" OPx@) + H("@) = 0 


then, T(x (£)) > QY(x(t)), ı = 0. 
Secondly, for ¿ — 1, given T(x(t)) > Q7 (x(t)), Vx(t). Then, as ų, it can be able 
to conclude that 


T OČA- TAE) = x OPH HHCH) 
+A(TEæE + D) — a+ 1) 
> A(T@E + 1)) — QV" + 1))) (4.18) 
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By the mathematical induction, it can be obtained that T*'(x(t)) > Q(x (£)), Ve. 
Combined with Lemma4.1, it is obviously concluded that T’*\(x(4)) > Q(x(f)) = 


T(x(4)), that is, [Ta wo} is a non-decreasing sequence, Vv. 


From Lemma4.2, the sequence }T(x(t))} is bounded to 3, which is equiv- 
alent to that the iterative equation has a limit value, which can be expressed 
as lim, >o T(x(t)) = T*(x(t)). Therefore, it is bold to assume that T*(x(t)) = 
min| x" (Px( + H(o@(t)) + ATEO) This assumption will be proved below. 


According to (4.14), 
TAE) < x @)Px(t) + HOE) + AT xt ++ 1). (4.19) 
From the non-decreasing property of sequence [Ta ©, it can be known that 
Tœ) < TOH) Ve. 
Substitute it into (4.19), 
TAE) < x @®Px(t) + HOE) + ATX +1), Ve. (4.20) 
(4.20) for any ų was established, that when ¿ = oo, also meet. 
T (x(t) < x (HPx(t) + H@(L)) + AT X(E + 1)), Ve. (4.21) 


Considering that @(t) is any given control sequence, (4.21) can further obtain: 


T*(x(0) < min] x”@®Px() + H(@(t)) +AT + D}, v. (4.22) 


With (4.14), Ta) = min| x TOPLA +H(@(t)) FAT t + 1} Me 
At this time of its on the left, and as a result of [Ta wo) non decreasing, get, 
T(x (£)) >minfx x" (t)Px (t) +H(@(t)) +AT (x(t + D}. Similarly, let 1 > oo, 
T*(x(t)) > minfa OP + H(@(t)) + AT(x(¢ + D}, Vo. (4.23) 
Combining (4.22) and (4.23), it follows that, 
T(x (t)) =min|x" (Px (£) + HØ) + AT (Et + 1» , Ve. (4.24) 
Can be seen from (4.24), the previous assumption proved how. Can be learned from 


Theorem 4.1, T*(x(£)) is a discrete-time time solution of HJB equation. Considering 
the uniqueness of the solution of the discrete-time-time HJB equation, it means 
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that T*(x(¢)) in (4.24) and Jx (Ł)) in (4.10) are the same solution. In other words, 
lim, >o T(x (4) = TEE) = Sa). a 


Theorem 4.1 proves that T*(x(¢)) in (4.24) and J*(x(£)) in (4.24) in (4.10) are the 
same solution of the HJB equation corresponding to the same cost function, while 
the termination criterion “||T’+!(x(£)) — T’(x(d) | < e” indicates that the optimal 
control law can be solved in finite time, and Theorem 4.2 will explain this context. 


Theorem 4.2 The system (4.1) is controllable and the initial state x(t) of the system 
can be chosen arbitrarily. Under the finite iteration index t, the iterative approximate 
cost function and the optimal cost function ||T* (x(t)) — T’(x(4))|| < € are equivalent 
to the termination criterion |T! (x) — T’(x()) || < €. 


Proof In Theorem 4.1, it is mentioned that [Ta ©} is a non-decreasing sequence, 
that is 
Fx) = Tx) = TEE) = TO). (4.25) 


If ||T*(x()) — T(x) < €, it can be concluded that 
T ŒH) = T'E) < e, TAE) < T'E) +e. (4.26) 
Combined (4.26) with (4.25), 
TŒ) < T EE) < TE) < TTE) +e 
> TOE) < TOE) < TOH) +e. (4.27) 


It can get that, 
[TH EM- TEO] <« (4.28) 


From a different perspective, if (4.28) holds and the li ©} is nondecreasing, 


= E+ TT EH) < TO) < TEO) =F). (4.29) 
It is obvious that T+! (x(£)) — T* (x (£) < €, 
[TH E) -TEO <e. (4.30) 


Based on the analysis of both sides, it can be concluded that ||T*(x(£))— 
TOI < € | TAO) -T@@)| < €. 


The two theorems deal with value functions T(x(¢)), while Algorithm 1 deals 
with costate function C(x (f)). It will be shown in Theorem 4.3 that this convergence 
is equivalent. 
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Theorem 4.3 (4.14) defines the sequence of value functions. The control law 
sequence is shown in (4.13) and the update cofunction sequence is shown in 
(4.16). The optimal value is chosen as the limit of the costate function C*(x(t)) = 
lim,-,o0C’(x(t)), and when the value function approaches the optimal value, the 
sequence of costate functions converges with the sequence of the control law. 


Proof In Theorems 4.1 and 4.2, itis shown that T* (x (¢)) and T” (x (t)) satisfy the cor- 
responding HJB equation respectively. i.e., T*(x(t)) =T(x()) = min|.x! (t)Px(t) + 
o(t 


HOO) ATOE D). 


Therefore, it can be concluded that the sequence ;T(x(¢)) + of value functions 
converges to the optimal value function of the discrete-time-time HJB equation. 
i.e.,T’ > T*, ast —> oo. 

Given C’(x(t)) = OT"x(t))/Ox(t). It is also possible that the corresponding 
sequence {cw wo) of costate function converges to C'— C* as 1 > oo. Due to 
the association, costate function is convergent, at the same time, it is concluded 
that the corresponding sequence converges to the optimal control law ø'—> ø* as 
L> oO. | 


4.5 Multi-factor Mixed Optimization Experiment 
Treatment of Tumor Cells 


This section explores a novel therapeutic intervention for tumor cell growth inhibi- 
tion. A discrete-time affine control system has been constructed from the multi-factor 
tumor cell growth model, and the iterative DHP algorithm has been applied to realize 
the reduction of drug dosage under the condition of greatly inhibiting the proliferation 
of tumor cell population. 


4.5.1 Discrete Affine Model of Tumor Cell Growth 


According to clinical medical statistics and literature [2, 30, 31, 38-41], the values 
of each parameter in the tumor cell proliferation model affected by multiple factors 
are shown in Table 4.1. 

Using these parameters, try to observe the tumor cell proliferation model given 
some circumstances. 

With reference to [1], the initial “7, (0) = 2 x 10’, 2,(0) = 1 x 10°, Cz (0) = 
10, C(0) = 6 x 10° ” was selected, and the chemotherapy drug at a dose of 
VUcne(t) = 3.5 was injected every 5 days in (4.8) to observe the changes of various 
cells in the current body. 
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Table 4.1 Estimated parameter values 
Parameter Description Estimated 
value 

a Tumor cell growth rate 4.31x 107! 

b The inverse of the carrying capacity of the cell. 1.02 x 107° 

c The percentage of circulating lymphocytes that become NK 2.08 x 1077 
cells 

c2 NK cell mortality 4.12 x 107° 

e Cy cell mortality 2.04 x 107! 

f Constant source of circulating lymphocytes 7.5.x 108 

g The natural death and differentiation of circulating lymphocyte | 1.2 x 107° 

bı Saturation level of fractional tumor cell kill by Cy cells 2.34 

i The killed index of some tumor cells by Cz cells 2.09 

j Fractional (non)-ligand-transduced tumor cell killed by NK 6.41 x 107"! 
cells 

f The highest recruitment rate of NK cells in tumor cells 1.25x 107 

m Steepness of the NK cell recruitment curve 2.02 x 10’ 

ny The rate at which NK cells stimulate Cz cellproduction after 1.1.x 1077 
killing tumor cells 

m The rate at which stimulate Cz cell production after killing 6.5x 107" 
tumor cells 

pı Maximum Cy cell recruitment rate 2.49 x 107? 

p2 NK cell inactivation rate by tumor cells 1x10 

qı Steepness of the Cy cell recruitment curve 3.66 x 107 

q2 Cz cell inactivation rate by tumor cells 1.42 x 10% 

t Regulation of Cz cells by NK cells 3x107" 
Steepness of the Cy cell recruitment curve 3.66 x 10” 

u Regulatory function by NK cells of Cz cells 3x107" 

LG, Fractional tumor cells are killed by chemotherapy 9x10"! 

Ln; Fractional NK cells are killed by chemotherapy 6x10"! 

Lez Fractional Cy cells are killed by chemotherapy 6x10"! 

Le, Fractional Cp cells are killed by chemotherapy 6x10"! 

Ya Rate of chemotherapy drug decay 9x107 

YB Rate of IL-2 drug decay 1x10' 


Figure 4.1 shows an injection method of chemotherapy drugs in the form of pulse. 
The drug is injected into the body to study the influence of the addition of chemother- 
apy drugs on the number of various cell populations in the body at different times. 
As can be seen from the curve of tumor cell change in Fig. 4.1a (the second curve), 
a dose of 5 chemotherapy drug injected every 5 days for 60 days is sufficient to 
control the proliferation of tumor cells. The four curves showed different forms of 
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(b) 


Fig. 4.1 Ten doses of chemotherapy over 60 days has been sufficient to eliminate the tumor. a 
Curves of the population of the four cell species. b Distribution of 10 doses of chemotherapy drugs 
within 60 days and the trend of changes in the concentration of chemotherapy drugs in vivo 
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oscillatory changes in the early stage, which mainly depended on the pulse injection 
of chemotherapy drugs, and immunospecific cell Cz, which also decreased to stabil- 
ity after tumor cells stabilized in the later stage. Figure 4. 1b shows the corresponding 
mode of administration, with the red is the pulse of administration and the green is 
the change of the corresponding chemotherapy drug in the body. 


4.5.2 Construction of Affine Model 


In (4.8), although the discrete model has been obtained, it is too complex and the 
addition of various coupling forms, which is difficult to be directly combined into the 
iteration-DHP structure. At this time, the idea of constructing a simple affine model 
is introduced. It can be easily learned from the above two sub-parts, which can be 
simplified as the influence of the injected concentrations of the two drugs on tumor 
cells in the body. Then, the current concentration of tumor cells can be selected as 
the state variable, and the injected concentrations of the two drugs (chemotherapy 
drugs and immune agents) can be used as the control variable to form a data set, 
starting from a large number of random data. The desired affine discrete model is 
obtained by fitting. 


x(t-+ 1) = f(x) + Beal u(t) (4.31) 


gi (log,o(x)) = 0.001771 (10g19(x)  —0.02931 (10819x) +0.1793 (10810) } 


—0.5353(log;o(x) J+ 1.741 (Iogio(x)) —1.133 (4.32) 


PA (log,o(x)) = 0.007579 (logio(x))'-0.1087(logio(x)) 


+0.4838(log,o(x))-+0.1783 (log, ox)’ -0.2304 (4.33) 


4.5.3 Optimization of Mixed Treatment Regimen 


Following the affine model mentioned above, it is necessary to specify the cost 
function required in iteration-DHP before optimizing the treatment: 
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Table 4.2 Default parameters 


A P mı U Uy m2 Uy Q 
0.95 1077 0.55 1.28 6 x 1048 |0.62 1.12 2 x 10° 
5 
3.5 X10 i l i 


The iteration error 


0 10 20 30 40 50 60 70 


The number of iterations 


Fig. 4.2 The iteration error change curve, after the end of the 67th iteration, satisfies the termination 
condition 


oe u(t) 
J(x(t)) = > AORO +m f tanh” (U, s) 
0 


1=0 


_ uz(t) ou 
ThQuds-+ms | tanh T's)TQeds}. (4.34) 
0 


According to clinical experience, the default parameters are shown in Table 4.2. 
The iteration error € is set to 1076, and the iteration error variation curve is shown 
in Fig.4.2. The error decreases extremely fast in the first twenty iterations of the 
calculation, and the convergence rate gradually decreases after 20 iterations. At 
u = 67, the termination condition has been satisfied. 

Analysis of tumor cells after meet the termination criterion, according to the 
optimized regimen of population change curve as shown in Fig.4.3, visible at an 
extremely rapid rate by the growth of stem. The usage and dosage of two drugs 
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Fig. 4.3 Tumor cell population changes in optimized treatment 


are shown in Fig. 4.4. Figure 4.4a represents the curve of injected concentration of 
chemotherapy drugs, and Fig. 4.4b represents the curve of injected concentration of 
immune drugs. 


4.6 Conclusion 


In this chapter, a tumor immune differential game system has been established to solve 
the problem of optimal clinical tumor treatment oriented to evolutionary dynamics. 
Firstly, a mathematical model of the game system between tumor cells and immune 
cells treated by immune agents and chemotherapy drugs has been given. Secondly, the 
bounded optimal control problem has been solved by the HJB equation with infinite 
horizon performance index which is subjected to practical constraints. Finally, the 
optimal iterative approximate control strategy has been obtained by the iterative dual 
heuristic dynamic programming algorithm, and the effectiveness of the proposed 
algorithm has been proved. 
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Fig. 4.4 Optimization of treatment of different drugs usage and dosage. a Injection concentration 
of chemotherapeutic drugs, b Injection concentration of immune agents 
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Chapter 5 A) 
N-Level Hierarchy-Based Optimal crest 
Control to Develop Therapeutic 

Strategies for Ecological Evolutionary 
Dynamics Systems 


5.1 Introduction 


The death toll from tumor diseases is on the rise, and the nonlinear dynamics and con- 
trol of tumor growth have attracted widespread attention [1]. The number of tumor 
cells is gradually increasing. The most obvious feature is abnormal anti-growth sig- 
nals. There is a strict control mechanism for normal cells. However, in the continuous 
process, the static and death signals are turned off to generate cell division signals, 
which leads to the crazy growth of tumor cells [2, 3]. Tumor cells promote the growth 
of blood vessels, which are necessary to provide nutrients. This is why the flow of 
blood in tumor tissues is related to the benign or malignant tumor. Cancer cells are 
also polarized. They have evolved their camouflage ability in the ongoing battle with 
immune cells, causing the immune system to mistake them for normal cells, which 
makes it difficult for chemotherapeutic drugs to distinguish the volume of biological 
targets [4, 5]. When the differentiation process of normal cells is not controlled, they 
will evolve into tumor cells. This is the nature of tumor cells, tumor cells continue 
to proliferate, deprive their limited body energy supply, and ultimately destroy the 
body’s function and die [6]. Therefore, in order to inhibit the growth of tumor cells, 
it is urgent to find a treatment that will minimize the damage to oneself. 

In the fight against cancer, before the advent of chemotherapy and radiother- 
apy, there have been no effective measures for the small differences between cancer 
cells and normal cells [7, 8]. When the side effects of radiotherapy and chemother- 
apy increased and the targeted therapy was highly targeted and inflexible, scientific 
research projects began to turn to humans themselves [9]. The complex and unique 
communities of cell life are called microenvironments by scientists. The microen- 
vironment has many characteristics that affect cell growth, behavior, and how to 
communicate with other cells nearby [10]. In the oncology world, researchers are 
committed to understanding the tumor microenvironment and trying to find feasible 
treatment opportunities. Under normal circumstances, the immune system can rec- 
ognize and eliminate tumor cells in the tumor microenvironment. However, in order 
to survive and grow, tumor cells can adopt different strategies to suppress the body’s 
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immune system and fail to kill tumor cells normally, thereby surviving the various 
stages of the anti-tumor immune response. The above-mentioned characteristics of 
tumor cells are called immune escape. Tumor cells escape the immune system, not 
because the immune system cannot recognize them, nor because it is not activated, 
but cancer cells have evolved a way to prevent T cell activation through specific 
binding [11-13]. Therefore, the medical community has been trying to find many 
special methods to treat cancer cells to block the activation of T cells and release the 
immune system. 

Chemotherapy not only kills rapidly differentiated tumor cells, but also involves 
conventional cells. Its side effects are the most obvious, but they can be alleviated 
by immunotherapy. The closure of immune checkpoints and the success of adoptive 
cell therapy have made immunotherapy a mature means of treating cancer [14, 15]. 
Compared with traditional therapies such as surgery, radiotherapy, and chemother- 
apy, immunotherapy has fewer side effects and better effects, but immunotherapy is 
difficult to overcome its transient nature. With the rapid increase in tumor patients, 
immunotherapy is rapidly emerging for the treatment of specific types of cancer, espe- 
cially tumors with poor immunogenicity [2]. The original intention of immunother- 
apy is to fight cancer cells through the lethality of immune cells themselves. As a 
typical immune deficiency syndrome, AIDS is caused by the failure of the immune 
response and is often attributed to the weakening of the immune level. However, once 
the activated immune system cannot be stopped, cytokines are produced, which is 
considered to be an overreaction of the immune system like COVID-19 [16, 17]. 
Therefore, the combined treatment of chemotherapy and immunotherapy is more 
reasonable. Immunotherapy refers to a treatment method that artificially enhances or 
suppresses the body’s immune function to achieve the purpose of curing diseases by 
referring to the body’s low or hyperimmune state. There are many immunotherapy 
methods, which are suitable for the treatment of many diseases. Tumor immunother- 
apy aims to activate the human immune system, relying on its own immune function 
to kill cancer cells and tumor tissues [18]. Unlike previous surgery, chemotherapy, 
radiotherapy, and targeted therapy, the target of immunotherapy is not tumor cells and 
tissues, but the body’s own immune system [19]. Different types of tumor cells inter- 
act with different types of immune cells, and these immune cells have the function 
of helping or attacking tumors [20]. 

The mechanism of immune regulation varies from person to person, but in the case 
of special calls, the optimal regulation based on immunotherapy will play a role in 
reducing tumor cells regardless of specific circumstances. Enhancing tumor antigen 
presentation can effectively stimulate dendritic cells and improve immunotherapeu- 
tic efficacy [21, 22]. The known “predation-prey” between immune cells and tumor 
cells will cause periodic growth and reduction of cells. This growth and reduction 
can continue indefinitely or reach a balanced saddle point determined by system 
parameters [23]. And all of the above is composed of a complex non-linear struc- 
ture, and it is difficult to achieve the global optimum with conventional optimization 
methods. Especially for the treatment of the human body, how to rationally use drugs 
to achieve the minimum harm to the human body is particularly important. So this 
article proposes a novel evolutionary calculation method, N-Level Hierarchy Opti- 
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mization (NLHO) algorithm. It is bionic from the hierarchical system of biological 
populations in the natural world. The hierarchical system refers to the hierarchical 
phenomenon in which the status of each animal in the animal group has a certain 
order. The basis of the formation of the hierarchy is the dominance behavior, that is, 
the “domination-submission” relationship [24]. When the formed hierarchical sys- 
tem stabilizes, lower-ranking people generally show compromise and obedience, but 
sometimes they also re-struggle to change the hierarchical order, and so on. A stable 
population will develop for a long time. This is an explanation for the rationality 
of the hierarchy preserved in evolutionary selection [25]. So for the entire species 
population, this is conducive to the preservation and continuation of the species. A 
variety of biological interactions constitute a complex nonlinear growth process of 
tumor cells, and the main influencing factors of tumor cell populations are the focus 
of research. Hunting cells refer to immune cells that participate in the removal of 
foreign objects and strengthen the immune response [26, 27]. 

In the NLHO algorithm, an N levels optimization structure is designed, which 
includes the leader level, guider level, executant level and follower level. In the entire 
population, the individual with the best search position is selected as the leader, who 
has the grasp of the entire search direction of the team it leads. The second level is the 
guider level, which executes the tasks issued by the leader and follows the direction 
of the leader to find the best. Of course, in the whole process, the guider will also 
refer to the task allocation of the global optimal leader to guide the executants to find 
the best, so as to prevent the leader of the team from falling into a local optimum. The 
third level is the executant level, which follows the guider to complete the task, in 
order to achieve a wider area of coverage search. At the same time, it will also refer to 
the tasks assigned by the leaders of the ethnic group to make the task goals clearer and 
speed up the convergence. The last level is the follower level. At this level, followers 
can be divided into any level to solve different optimization problems. Of course, 
in the later stage of searching, there may be excessive overlap between population 
individuals [28]. 


5.2 Ecological Evolutionary Dynamics Systems Model 


This part mainly introduces the mathematical growth model of tumor cells, which 
takes into account the influence of external factors such as chemotherapy drugs and 
immunotherapy on tumor cells, as well as the interaction between the two cells. In 
the following model, T(t) represents the number of tumor cells, /(t) represents the 
number of immune cells, Concne(t) and Conim (t) represent the blood concentration 
of chemotherapy drugs and immunotherapy drugs, respectively. 

Taking into account the interaction between immune cells and tumor cells, the 
direct killing of chemotherapeutic drugs and the growth model of tumor cells can be 
written as 


80 5 N-Level Hierarchy-Based Optimal Control to Develop Therapeutic Strategies ... 


T+) =T) +8 x T x (1-® x TH) T 

—yx T(t) x I(t)— ex T(t) X Conene(t) aD 
where, V; stands for inherent growth rate unrelated to immune cells and chemother- 
apy drugs, V2 stands for the maximum interaction ability between immune cells and 
tumor cells, ignoring chemotherapy drugs, ~y stands for the growth rate when tumor 
cells are inactivated and attacked by immune cells, £ stands for the stress response 
coefficient of tumor cells to chemotherapeutics. 

Considering the natural growth law of immune cells, we assume that a fixed 
number of immune cells are produced in a unit time, and these cells have an inevitable 
life cycle. Tumor cells in the body can stimulate the growth of immune cells, which 
is a positive non-linear change. In immunotherapy, the addition of immune drugs 
can produce an immune response, leading to non-linear growth of immune cells. At 
the same time, in the struggle between immune cells and tumor cells, the immune 
cells themselves will also cause losses. In chemotherapy, chemotherapy drugs can 
also cause damage to immune cells. 


It+1)=/@)+03—-Ax I(t) 
a, x T?(t) x I(t) a2 x T(t) x Conim (t) 
Bi + T(t) By + Conim(t) 
— 1 x T(t) x I(t) — E2 x Conce (t) x I(t) 


(5.2) 


where, v3 stands for rate of continuous inflow, À stands for natural decay rate without 
any external effects, a, stands for maximum recruitment rate caused by tumor cells, 
Qa stands for the largest proportion of tumor cells caused by immunotherapeutic 
drugs, (3) stands for steepness factor caused by tumor cells, 3; stands for steepness 
coefficient caused by immunotherapeutic drugs, é stands for stress response coef- 
ficient to chemotherapy drugs, éz stands for response rate of tumor cells to immune 
cells. 

At a certain point in time after the injection of chemotherapy drugs, the concen- 
tration of the drugs in the body will decrease exponentially. We are adding immune 
drugs at the same time. We can get the attenuation model of chemotherapy drugs and 
immune drugs in vivo. 


Concene(t + 1) = Xche (t) — e™” Conene(t) (5.3) 
Contin (t + 1) = xim(t) — e” Conin(t) (5.4) 


where, Xche(t) and Xim(t) represent the concentration of chemotherapeutic drugs 
and immune drugs, respectively. 0; and 62 are the attenuation rates of chemotherapy 
drugs and immune drugs. 

When we qualitatively analyze how to minimize the number of tumor cells remain- 
ing in the bloodstream under the premise of using as few drugs as possible, including 


5.3 N-Level Hierarchy Optimization Algorithm 81 


chemotherapy drugs and immune drugs, this process can be described by quantitative 
mathematical expressions. From formulas (5.1)—(5.4), we can get: 


t Xche (t) z = 
Frin = yoo foro +f tan`! (Ur 's)U; Rids 
tt (5.5) 


Xim (1) _ a 
+ 1 an=! (5's) D> Rods| 
0 


where, U; and U; respectively represent the maximum allowable dose of chemother- 
apy drugs and the dose of a single injection of immunizing agent, ô is the discount 
factor, w is a constant coefficient. 


5.3 N-Level Hierarchy Optimization Algorithm 


5.3.1 Leader Level of the Hierarchy 


First of all, as individuals with high fitness values, leaders have strong self-learning 
capabilities. Therefore, the iterative formula of design leaders is as follows: 


ie = x (1 + randn( u, a1) (5.6) 
where, i denotes the ith leader in the population, and j is the dimension. ¢ is the 
number of iterations. Randn is a Gaussian distribution, where the mean ju; = 0 and 
the standard deviation g; is shown below: 


1 HSR iell,2-N il 5.7 
exp(fi — fi) a ge a iF (5.7) 


where, fi is the fitness value of the /th leader at the tth iteration, and fiis the fitness 
value of any individual in the population that is different from the /th leader. 


5.3.2 Guider Level of the Hierarchy 


Secondly, as the individuals who guide the general direction of the evolution of the 
entire population for the leader, the guider must not only learn from the best overall, 
but also obey the leader’s command. 
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PPL a. ut 2 eee 
Xj = Xg,j + randn([Ug, 77) x (x) j Xq,;) 


(5.8) 
+5, x best, j — xj) 


where, g denotes the gth guider in the population, best is the best individual in the 
current iteration, sı is the acceptance factor of guider, ug = 0.5. 


og =exp( fi — fy) (5.9) 
= hae a Í 1 
Sı = exp (ie <= (5.10) 


where, £ is an infinitesimal value to prevent a guider from having a fitness value 
of 0. 


5.3.3 Executant Level of the Hierarchy 


The executants seek the best as the main body of the entire population. On the one 
hand, follow the guider’s arrangements, and on the other hand, follow the leader’s 
direction. 


t+1 t 2 t t 
ej = Xej t randn([e, 03) X (Xg, j — Xe, ;) 


S (5.11) 


X 


+s X i,j — xe j) 


where, e is each executant in the population, s2 is the acceptance factor of the executor, 


He = 0.8. 
otek 
c= exp (274) (5.12) 
s2 =exp(f/ — fi) (5.13) 


5.3.4 Follower Level of the Hierarchy 


Finally, there are followers, who themselves will be divided into multiple levels. 
Learn from each other at different levels, and notify the follow-up executant to check 
for deficiencies. 
1 2 
xe = xpi oe randn (Wf, . o%,) x Xj > xj) (5 14) 
+n x rand x (x, j — x’ j) 
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where, fn is the n-level follower, c, is the absorption factor of the n-level follower, 


Mp, = 0.8 — 0.6 x (t/tmax) Xi, j = Xe jfi = fe nis anatural number greater than 
0. 
on = expt, — Tp) (5.15) 
en = exp(fp . — fi) (5.16) 


5.4 Simulation and Analysis for NUHO 


In this experiment, the population size is set to 100, and the maximum number of 
iterations is set to be 100. Each algorithm is run independently for 50 times, and the 
spatial dimension is selected according to different test functions. The distribution 
rates of each level system are LPercent = 10%, GPercent = 20%, EPercent = 40%, 
and FPercent = 30%. The value of the updated algebra G = 10. 

For independent tests of 20 test functions, we separately count their mean, mini- 
mum and standard deviation to evaluate the performance of NLHO in various aspects 
by setting the difficulty in different aspects. At the same time, we select some typical 
algorithms for comparison, such as Taboo Search (TS), Chicken Swarm Optimization 
(CSO), Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Simulated 
Annealing (SA), so as to compare the performance of the NLHO algorithm horizon- 
tally. The test results are shown in Table 5.1. 

For the independent tests of the benchmark functions, we calculated 5 parameter 
indexes respectively, which were their best, worst, median, average and std. deviation. 
At the same time, in order to verify whether the results are statistically significant, 
we use the Wilcoxon rank-sum test between NLHO and the other algorithms. “+”, 


Table 5.1 Experimental simulation results 


Function Optimization CSO GA ACO SA NLHO 
Ackley Mean 2.685462(-) | 8.88E-16(~) | 17.98669(-) | 8.157645(—) | 0.381913(-) | 8.88E-16 
Min 1.599236(-) | 8.88E-16(%) | 2.532499(-) | 0.818328(—) | 0.024938(-) | 8.88E-16 
Std. 3.327767(-) | 0(%) 5.021448(-) | 2.860999(-) | 0.248066(-) | 0 
Deviation 
Cross-in- Mean —2.04281(-) | —2.06247(-) | —1.52287(-) | —2.05567(—) | —2.05048(-) | —2.06261 
Tray 
Min —2.05463(-) | —2.06261(~)| —2.06257(-) | —2.06261(—) | —2.06257(-) | —2.06261 
Std. 0.050988(—) | 0.000129(-) | 0.314647(-) | 0.007304(—) | 0.008585(-) | 1.17E-10 
Deviation 
Drop-Wave | Mean —0.90552(-) | —1() —0.16478(—) | —0.9568(-) | —0.98441(-)| —1 
Min —0.93563(-) | —1 (~) —0.90648(—) | —1(~) —0.99988(-) | —1 
Std. 0.105019(—) | 1.63E-07(-) | 0.194473(-) | 0.025337(-) | 0.016453(-) | 0 
Deviation 
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“—» and “x” mean that the proposed NLHO is significantly better, significantly 
worse, and no significantly statistically different in the comparison, respectively. 

It can be seen from the experimental results that, compared with the other 5 
algorithms, under 20 test functions 60 indicators, NLHO wins 58, 29, 59, 53 and 
56 indicators, and they belong to different types of test functions, which reflects the 
better robustness of NLHO. Moreover, it can be seen from the Wilcoxon rank-sum 
test that NLHO is not only different from other population algorithms, but also has 
obvious advantages. The experimental results for each function are discussed in more 
detail below. 

The Ackley function fı is widely used to test optimization algorithms. It is a con- 
tinuous experimental function obtained by superimposing an exponential function 
with a moderately amplified cosine. It is characterized by an almost flat outer area, 
which is modulated by a cosine wave to form holes or peaks, making the curved sur- 
face undulating, but there is a large hole in its center. For the Ackley function, both 
NLHO and CSO have found the global minimum, and the average value is also equal 
to the global minimum, so that the standard deviation is also 0. The optimization 
process diagram of the NLHO algorithm is shown in this article, including the initial 
iteration diagram, the final result diagram and the intermediate process convergence 
curve, as shown in Fig.5.1. The two algorithms are comparable, SA performance is 
average, while TS, GA and ACO perform poorly. 

The Cross-in-Tray function fọ has multiple global minimums. On this function, 
the mean, minimum and standard deviation of NLHO have reached the best, and the 
experimental results are shown in Fig.5.2. At the same time, ACO also found the 
global optimum, but the mean and standard deviation are slightly inferior to NLHO. 
TS, CSO and SA performed well, while GA performed average. 

The Drop-Wave function f3 is multi-modal and very complicated. In each smaller 
input domain, its features have multiple ring-shaped peaks and valleys, and the depth 
of the valleys gradually decreases as the center of the circle shrinks. For the optimiza- 
tion process, it is very easy to fall into the local optimum. For this kind of complex 
optimization function, NLHO has shown strong optimization ability, and the global 
minimum, mean and standard deviation have all reached the best. The experimental 
results are shown in Fig.5.3. CSO, ACO and SA performed well, and TS and GA 
performed poorly. 


5.5 Develop Therapeutic Strategies for Ecological 
Evolutionary Dynamics Systems Using NLHO 


In this section, we apply the NLHO algorithm to the EEDS model as an experimen- 
tal verification. According to clinical treatment needs, chemotherapeutic drugs and 
immune drugs are used as input, and the cost of treatment loss is used as the objec- 
tive function. Through the iteration of the NLHO algorithm, the optimal therapeutic 
strategies for patients with a certain basic condition are worked out. For some of the 
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Fig. 5.2 Cross-in-tray 
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Fig. 5.3 Drop-wave 


Table 5.2 Experimental parameter 


Parameters Value Units Parameters Value Units 
v 0.00431 day~! 02 1.02 x 107° | cell! 
y 6.41 x 107! cell7! E 0.08 day~! 
x 0.204 day! & 3.42 x 1076 | cell! 
& 2x 1071 day! ay 0.0125 day~! 
a2 0.125 day~! A 2.02 x 10’ | cell? 

b2 2x10 cell i 0.1 day~! 
05 1 day~! 5 0.95 N/A 
w 0.1392 x 1074 |N/A 


According to clinical medical statistics borrowed from the literature [29], the 
specific parameters of the dynamic models are presented as Table 5.2. Based on the 
above, we have completed the establishment of the EEDS model, and determined the 
specific value of the cost function according to clinical needs. At the same time, the 
feasibility and effectiveness of the NLHO algorithm are also verified on benchmarks. 
Apply NLHO to the model of EEDS to develop therapeutic strategies. The best 
processing strategy is obtained through experiments, which proves the effectiveness 
and feasibility of the algorithm. The cost function is designed to minimize the number 
of tumor cells, and also to use the smallest dose of chemotherapeutic drugs and 
immune drugs to achieve the least harm to the human body. 
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Fig. 5.4 Quantity curve of tumor cells 


When we give the patient the initial number of tumor cells and immune cells, 
according to the EEDS model and follow certain chemotherapy and immunotherapy 
plans, we can get the following set of curves of tumor cells and immune cells. As 
shown in Figs.5.4 and 5.5, it shows the quantity curve of tumor cells and immune 
cells. Within a one-year treatment period, the number of tumor cells was successfully 
reduced to 254. Although the number of immune cells was reduced to 1.52x 106, 
significant effects were obtained for the treatment of tumors. Moreover, as shown in 
Figs. 5.7, 5.8 and 5.9, due to the decline of the body’s immune cells, the immune drug 
dropped to 0.022 and then increased to about 0.03. This caused the concentration of 
immune drugs in human blood to rise from the trough of 0.0096 to 0.03, reaching a 
sufficient level. 

The dosage of chemotherapeutic drugs is adaptively and dynamically changed, 
as shown in Fig. 5.6. Then the concentration of chemotherapeutic drugs in the blood 
also has the same changing trend, as shown in Fig. 5.8. The reason is to minimize the 
damage of drugs, and chemotherapy drugs are also harmful, not only killing tumor 
cells in the body but also destroying immune cells. If chemotherapeutic drugs are put 
in according to normal treatment methods, normal cells will suffer a lot of erosion, 
and the impact on the body will be even more significant. However, the drug dose 
optimized by NLHO will dynamically change adaptively, and the impact on normal 
cells will be appropriately reduced without affecting the elimination of tumor cells. 
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Fig. 5.6 Dosage of chemotherapeutic drugs 
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Fig. 5.8 Concentration of chemotherapy drugs 
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Fig. 5.9 Concentrations of immunereagents 


5.6 Conclusion 


It is a difficult problem to solve optimal therapeutic strategy for EEDS. Theoreti- 
cally, it can achieve the desired therapeutic effect through reducing the tumor cells 
through the combined of chemotherapeutic drugs and immune drugs, and minimize 
the harm to the human body. Benefiting from the concept of heuristic algorithm 
in evolutionary computing, this chapter has designed the NLHO algorithm via 20 
benchmark functions to test NLHO, including unimodal and multimodal, single- 
mode and multi-mode, single-extreme and multi-extreme, etc. It is compared with 
the five algorithms of TS, CSO, GA, ACO and SA, and runs independently 50 times 
to calculate the mean, minimum and standard deviation. It proves that NLHO has 
good optimization ability and can solve various problems well. At the same time, 
the development therapeutic strategies of EEDS have achieved very good results. 
The experimental results have shown that the NLHO algorithm develops therapeutic 
strategies well, and provides valuable prior knowledge and scientific basis for clini- 
cal medicine. Future work will further improve the EEDS, and integrate the optimal 
control strategy and the evolutionary calculation method for the optimal treatment 
method. 
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Chapter 6 A) 
Combination Therapy-Based Adaptive get 
Control for Organism Using Medicine 

Dosage Regulation Mechanism 


6.1 Introduction 


The death toll is soaring caused by neoplastic diseases, and the issues on nonlinear 
dynamics and control of tumour growth have motivated a widespread concern as [1]. 
Essential nutrients in humans are the resources for which the normal cells and tumor 
cells compete. Tumour cells will keep proliferating, robbing the limited energy supply 
of the body, and eventually disintegrating the somatic function to death. Somatic 
cells constantly divide, and new cells differentiate which end with apoptosis. In this 
manner, the relative balance can be maintained in human bodies. Nevertheless, when 
the process of differentiation for normal cells is out of control, the cells may well 
evolve into tumor cells. It is the nature for the tumor cells of which the tendency is 
to eat the body’s nutrients crazily. 

The population of tumour cells progressively increases for the following three 
characteristics. Firstly, the most obvious characteristic is the insensitivity to anti- 
growth signals. There exists strict control mechanism for normal cells, but for tumor 
cells, this mechanism is no longer valid. During the continuous process of division, 
tumor cells can escape from the monitoring of the anti-growth signals, which leads to 
the crazy growth of tumour cells. Secondly, tumour cells have the ability to promote 
the growth of blood vessels which are essential for providing nutrients, and it is the 
reason why the blood vessel density is associated with the malignant degree of the 
tumor tissue. Finally, tumor cells are also duplicitous, evolving camouflage abilities 
during their constant battle with immune cells to mislead the immune system into 
regarding them as normal cells, which results in the tumor immune escape. Thus, to 
suppress the growth of tumour cells, obstructing the generative mechanisms which 
relies on the necessary nutrients was an effective approach as literatures [2, 3]. 

Distinguishing from the mixed tumor treatment approach of immunotherapy and 
chemotherapy as [4], this chapter explores a more effective adaptive control strategy 
for organism using medicine dosage regulation mechanism. An additional popula- 
tion of cells which called endothelial cells enjoy the substances induced by malignant 
tumour cells, and they could transfer oxygen and nutrients to the primary focus caus- 
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ing proliferating of blood cells, which will increase carrying capacity of tumour cells 
known as tumour angiogenesis in [5]. As indicated in literature [6], anti-angiogenic 
agent could particularly decrease the growing rate of tumours, reaching saturation to 
some extent without killing the endothelial cells completely. When the chemotherapy 
agent was used in combination with anti-angiogenic agent to reduce the population 
of tumor cells , the latter could increase the effect of the former as described in [7]. 
Nevertheless, as the key element of promoting the growth of the vasculature, the 
endothelial cells could not be completely destroyed. Otherwise, it may not exist that 
the specified number of vasculature for constructing access of chemotherapy agent. 
On the basis of the pharmaceutical science concerning the chemotherapy agent and 
anti-angiogenic agent, the adaptive control strategy for organism will provide a guid- 
ance for clinical practice under the medicine dosage regulation mechanism, espe- 
cially in the treatment process of Lung cancer. Furthermore, what counts is that since 
the anti-tumor drugs often kill both tumor cells and normal cells, it’s of significance 
to utilize less drugs to achieve the therapeutic goal during the treatment process. 

ADP is derived from dynamic programming and reinforcement learning, is a 
powerful tool to tackle optimization issues [8—10]. In general, the successful imple- 
mentation of ADP-based methods depends on the cooperative work of actor and critic 
networks [11]. Under this framework, the actor is responsible for performing the con- 
trol strategy with current data [12]. The goal of critic is to provide actor with the 
feedback information derived from the evaluation of the cost under the strategy. The 
distinct merit of this type of algorithm lies in that the optimal control strategy could 
be approximately acquired in the manner of iteration computation, and the “curse of 
dimensionality” could be obviated with effect. Different ADP-based methods have 
been researched by scholars to tackle multifarious optimal control problems with the 
aid of the artificial neural networks of which the performance is outstanding [13, 14], 
such as the robust control [15, 16], optimal consensus control [17—19] and the opti- 
mal tracking issues [20, 21]. Furthermore, for the system with multiple controllers, 
the optimization issues can be formulated by game theory. As a vital branch of game 
theory, NZSGs is derived from [22] with the goal of attaining the optimal strategy pair 
that can minimize the personal performance index for each player when stabilizing 
the controlled system [23-25]. Due to the excellent ability to approximate optimiza- 
tion, the ADP methods have been proposed to solve NZSGs. In [26], the adaptive 
method of critic-only structure was developed to solve two-player NZSGs without 
any initial stabilizing control. The experience replay technique was integrated into 
the ADP algorithm in [27] to concurrently utilize the historical data together with the 
real-time data to approximate the value function such that the persistence of excita- 
tion condition was not indispensable. In [28], the data-based integral reinforcement 
learning algorithm was proposed to solve NZSGs. More specially, it was a novel 
iterative learning algorithm based on both off-line and online manner which could 
extend the applicability of the data-based control scheme. Furthermore, in [29], the 
discrete-time N-player NZSGs was tackled via the off-policy reinforcement learning 
method which was independent of system dynamics. 

Although the relevant academic achievements have been presented in theories and 
applications as [30-38], there is seldom any literature on this filed according to litera- 
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ture survey of the authors. The contributions can be shown as follows. First, the near- 
optimal therapy for the treatment of tumor is firstly acquired via the ADP approach 
which is an efficient adaptive intelligent learning algorithm. Second, the interactive 
system with discounted value function is constructed based on the mathematical 
model simulating the interaction relationships among cells and drugs. Besides, two 
kinds of chemotherapy drugs and a kind of anti-angiogenic agent participate in the 
therapy such that the combination therapeutic strategy can be derived under the archi- 
tecture of NZSGs. Third, the idea of cybernetics is extended to the frontier fields of 
medicine, more precisely, the therapy of tumor. Under the MDRM, the derived ther- 
apeutic strategy can achieve the therapeutic goal with the lowest doses of drugs, and 
the practical indications for medicine is considered for the first time. 

Notations: Nt denotes the set containing all positive integers. || - ||, diag{-} and 
v(-) £ O(-)/Ox respectively represent the Euclidean norm of a vector/matrix, the 
operation of constructing diagonal matrix and the gradient operator. Am (+) and Ay (-) 
separately denote the minimum eigenvalue and maximum eigenvalue of a matrix. 
I„xn is the unit matrix whose dimension is n. 


6.2 Preliminaries 


6.2.1 Establishment of Mathematical Model 


In this section, the growth mathematical model is established which considers the 
interaction relationships among the normal cells, tumor cells and endothelial cells. 
Moreover, the effects of control inputs, i.e., the chemotherapy and anti-angiogenic 
drugs, on these cells are embodied in the model. Thus, in the model formed from 
ordinary differential equations as follows, Pyc(t), Prc(t) and Pgc(t) respec- 
tively represent the populations of normal cells, tumor cells and endothelial cells, 
Pep, (t)(y = 1, 2) and Pap(t) denote the concentrations of chemotherapy and anti- 
angiogenic drugs. 

The population of normal cells, which is influenced by tumor cells, endothelial 
cells and the concentrations of chemotherapy and anti-angiogenic drugs, is modeled 
by 

; Pyc(t) 
Pyc(t) =a Pye (1 = a) — AıPyc(t)Prc(t) 
Pyc(t)Pcpi (t) 
Bı + Pyc(t) 
Puc (t)Pcn2(t) 
Bı + Pyc(t) ’ 


— E1 (Pec (t), Pan (t) 
— &)(Pec(t), Pan(t)) (6.1) 


where £, (Pec (t), Pap(t)) = 2,1 Pec (t) + Si2Pap(t) + 2,0,1 = 1, 2. The param- 
eters a,, By, Cı denote the proliferation rate, Holling type 2 constant and carrying 
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capacity for normal cells, respectively. A; is the contention parameter between nor- 
mal cells and tumor cells. 

As the tumor cells contend with normal cells for necessary nutrients, the popu- 
lation of tumor cells is affected by that of normal cells. Besides, there exist mutual 
effects among tumor cells, endothelial cells and the drugs. Thus the corresponding 
model can be written as 


: _ a2 Prc(t)Prc(t) Prc(t)Pcpilt) 
Prc(t) = a2Prc(t) C4 © Prot) TI (Pec(t), Pav) Prot) 
Prc(t)P, 
— Ih (Prc (t), Pant) Toet) — A2Pyc(t)Prc(t), (6.2) 


where TI (Pec(t), Pan(t)) = Iı Pec(t) + IT,2 Pap(t) + Io, J= 1,2. The 
parameters a2, B2, C2 are multiplication rate, Holling type 2 constant and carry- 
ing capacity for tumor cells. Az is contention parameter between normal cells and 
tumor cells. 

The population of endothelial cells is associated with tumor cells and anti- 
angiogenic drugs. The relations can be given as 


Pec) &3Prc(t)Pap(t) 


Pec(t) = s+ K Pre()+03Pect)(1- =E eae 
2 EC 


(6.3) 


where K is multiplication rate caused by tumor cells and sı the inflow rate. Similarly, 
the parameters a3, B3, C3 are multiplication rate, Holling type 2 constant and carrying 
capacity for endothelial cells. Æ; is the killing rate for endothelial cells. 

The concentrations of the drugs decrease during the treatment phases, owing to 
the washout process. Hence we can model the evolution process of the concentrations 
of chemotherapy and anti-angiogenic drugs by 


Pyc(t) PA Prc(t) 
1 
Bı + Pyc(t) B+ Prc(t) 


Pepi(t) = Dra — (Ba +m )Pent) (64) 


Pyc(t) m Prc(t) 


Pep2(t) = Dra-(8 2+m3 4 
B,+Pwnc(t) By + Prc(t) 


)Pen(t) 6.5) 


and 


msPec(t) 
B3 + Pec(t) 


Pan) = Dra — (Ga + ) Pav(o), (6.6) 


where Dr,,, Dre2 and Dr, are the control inputs. @,), G-2 and Ba denote the washout 
rates for the drugs. mı, m2, m3, m4 and ms are the rates at which the drugs integrate 
into the cells. Based on the operations similar to that in [39], we obtain the simplified 
version of the model as 
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Pnc(t) =a1 pwc — pwc) — apne) prc(t) 
7 punct) peni (t) : Pnc(t)Pcpn2(t) 
bi + pyc(t) bı + pwc(t) 
Pre t) 
1+ opec(t) 
Prce(t)pcni(t) Prce(t)pcpn2(t) 
bo + prc(t) bz + prc(t) 


Pre() =oapre((1 ) - propre) 


Pec(t)pap(t) (6.7) 


Pec(t) =s1 + kprc(t) + a3pec(t)C — pec(t)) — & 


` bs + pec(t) | 
peoi (Ð Suei ~ (Ba +m- pac ae oe ay) Peni, 
Pcp2(t) =u — (Ba +m3 D, PA T prc) penn. 
Pavitt) = ta ~ (Pa + PERO) pani), 


where &,(pec(t), pan(t)) = &1pectt) + &2pan(t) + &o and 7, (pec(t), pav()) 
= Tj pecht) + 72pap(t)+7,0 with 1,7 = 1,2. The states pyc(t), prc(t), 
Pec(t), Pcpi(t), Pcp2 and pap are nonnegative. 


Remark 6.1 The differential equation (6.7) is the simplified model describing the 
interaction relationships among cells and drug. Observing the model, one can dis- 
cover that there exists competition between normal cells and tumor cells. The tumor 
cells require more nutrients such that they facilitate the proliferation of endothe- 
lial cells, which could provide the indispensable nutrients to promote the growth 
of tumor. The tumor cells can be effectively damaged by the chemotherapy drugs 
which have side-effects on normal cells to some extent, and the anti-angiogenic drug 
contributes to the proliferation inhibition of the endothelial cells. 


6.2.2 Nonzero-Sum Games Formulation 


Consider the interaction model (6.7) rewritten as 


03x1 03x1 03x1 03x1 
; 0.06 0 0 O 
x= f(x) + 0 0 uy + 01 0 u2 (6.8) 
0 0.12 0 0 


= f(x) + gia + guz, 


where u1 = [uc1, Ua]? , u2 = [ue2, OJ" and f (x) is constructed by the right-hand side 
parts of (6.7) excluding the terms u,1, Uc2 and ua. 
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Define the value function for player 7(2 = 1, 2) as 
[o0] 
Vi(x(t)) = f eS (x, u1, wa)ds, (6.9) 
t 


where the utility function ô, (x, u1, u2) = x Nix +uT Raui +u? Ruz. The 
matrixes R,, (1, J = 1,2) and Y, are positive definite, and ọ, > 0 is the discount 
factor. According to the value function (6.9), we can define the corresponding Hamil- 
tonian function as 


H, (x, u1, u2) = (VV,)" (f + giui + gota) 
+ô, (x, u1, U2) — Q: V,,1 = 1,2. (6.10) 


The optimal value function is defined as 


ss N=2 
y*ř = min f e BO QTN x + 5 u, Riu, )ds. (6.11) 


a 
1 t = 


The target of NZSGs is to attain the admissible strategy pair {u{, už} with the 
definition given in [23, 40]. According to the stationarity condition, the optimal 
strategy for player 1 could be obtained by 


1 
u* = -5 n VV". (6.12) 
Thus the HJEs can be obtained as 


H, (x, uï, už, VV“) = (VVT + gut + 9203) 
+T x + UW Rau] +u Ruš — oV* =0. (6.13) 


Remark 6.2 It’s noteworthy that there exists no zero equilibrium for system (6.8), 
which may well result in the divergence of V,(x(t)). To resolve this issue, the dis- 
counted factor g, is introduced to form decay term such that V, (x(t)) can be conver- 
gent. 


In general, solving NZSGs is synonymous with solving the equations (6.13). Nev- 
ertheless, for nonlinear system, it’s very intractable to tackle the coupled equations. 
To resolve this difficulty, an ADP method utilizing dosage regulation mechanism is 
proposed in the following sections. 
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6.3 MDRM-Based Adaptive Critic Learning Method 
for NZSGs 


Firstly, we introduce the indications for medicine to judge when the medicine dosage 
should be regulated. Then under the MDRM, the ADP method of single-critic archi- 
tecture is proposed to approximately seek the optimal strategy for the NZSGs of 
model (6.7). 


6.3.1 MDRM-Based Optimal Strategy Derivation 


For the sake of realizing conditioned therapy strategy, MDRM is required to handle 
the clinical data such that the strategy can be changed timely and necessarily. The 
time sequence {ñe} is constructed for recording the regulating instants and £ denotes 
the £th regulating instant. Then the state could be denoted as 


Xe(t) = x (fe), t € [he, Mepa). (6.14) 


For evaluating the difference between real-time data and latest recorded data, 
it’s necessary to define an error function that zp = Xe — x(t), t € [Re, hey). The 
operation of MDRM depends on the regulating condition which compares the error 
Ze with the threshold associated with real-time data. The strategy is adjusted only 
when zg is larger than the threshold. That is, ù, = w,(X¢),1 = 1, 2, and £ € N+. Thus 
the MDRM-based strategy could be got as 


1 . 
ay = -Rn eV a =1,2, (6.15) 


where VV* = ðV* /ðx whent = he. The version that based on the adjustment mech- 
anism of HJEs is derived as 


N=2 
H, (x, tf, 0, VO = 7 D VVV DORG RIR 9) OVV? 
j=l 


N=2 
1 eee 
+ (VV*)7 |f- 5 y URTI, K)VV) | +x? Gx — oV”. (6.16) 


J=! 


Differing from HJEs (6.13), due to the existence of the error zz, (6.16) does 
not equal to zero. Before proceed with the discussion, the following assumption is 
required [41]. 


Assumption 6.1 The optimal strategy u* is locally Lipschitz. That is, for: = 1, 2, 
there exists a constant 0, > 0 such that ||w* — a*||? < 0, |x — žel’. 
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Theorem 6.3 Consider the system (6.8), and suppose that Assumption 6.1 holds 
and V* is the solution of (6.13). Then u* formulated as (6.15) can stabilize system 


(6.8) when the following medicine indication is applied 
(1 = 26)Am (1) 


2 
Du) Ixl, (6.17) 


lzel? < 
where À € (0, 1/2) is adjustable parameter. The terms 0, Y and Y are given in (6.21) 
and (6.22). 


Proof Selecting the Lyapunov function Lya = VĚ + Vž, we can obtain the corre- 
sponding derivative as 


N=2 
Lya = X (VVT (F + gŭ} + p3). (6.18) 


1=1 


According to (6.13), we have 


(VV*)" f =— (VVA (gut + puž) — x7 Nx 


— uj Raul — u3 Rou} + o V*, (6.19) 
and 
N=2 N=2 
(VV)? >> 9, (u* — tt) = —2u** Rag D> g (us — ii*). (6.20) 
J=! j=l 


Let u* = [u*?, u37]" and a* = [(ŭùŭ* — ut)", (ŭž — už)" ]". Then we can derive 
that 


N=2 N=2 
Lya =- x Nx — x? Nx — > X wT Ryu 

=l. j=1 

N=2 N=2 

xT —1 EE * 
+2 D ur Rug, AG = w) +o VÝ + oV 
1=1 j=l 
=— xT Yx — a Ru* — 2u" Zù* 


+ oi VŽ + avs, (6.21) 


where Y=N +N, R= diag{Ri + Ro, Ria + Rr}, and Z = [Z,, Z2] with 
LZ = [Rigi I. Rag gl”, 1 = 1, 2. Applying Young’s inequality, we have 
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Lya <—x' Tx — uT Ru* +u T Ru* 
ae ZTR! Zù* + py 
=— x" Yx + ùT YÙ" + ov, (6.22) 
where Y = ZT R7! Z. It’s noted that u* is the admissible strategy, we can derive that 
V* is bounded. Hence oy is the bound of the term 0) V¥ + o2 Vž. According to the 


definitions of Y and Y, we have that A„ (1) > 0 and Ay(Y) > 0. Furthermore, we 
can obtain 


Lya < = An (lx? —  — 26) Am (YI? 
+ Am(Y)Ollzell? + ov, (6.23) 
where 0 = 0; + 02. When the indication (6.17) is satisfied, we derive that Lya < 


—2¢Am(Y)||x|/? + ov. Then we can find that Lya < 0 holds when ||x|| > | OT 


In light of Lyapunov theorem, the strategy (6.15) can stabilize system (6.8). This 
completes the proof. E 


6.3.2 Implementation of Adaptive Critic Learning Method 


In this section, the approximate optimal strategy under MDRM is derived by ADP 
method of single-critic architecture. In light of the universal approximation properties 
of neural networks (NNs), V* can be obtained by 


V = wf, (x) +0,,1 = 1,2, (6.24) 
where w¥* is the ideal weight vector, v, the activation function and g, the approximate 
error. To acquire the approximate version of the unknown vector w*, the critic NN 
is constructed by 


V, =O" y,(x),1 = 1,2, (6.25) 


with ô being the approximate vector. With the aid of critic NN, we can present the 
optimal strategy as 


1 
a= -Ra (WH)! er + Vo,),1 = 1,2. (6.26) 


L 


Accordingly, we can obtain the optimal and approximate optimal strategies under 
MDRM as 


ŭ* = — 5; of (Xe) (V4 (Xe) w* + Vor (Xe), (6.27) 
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and 


1 
i, = -5 n g7 (Xe) (Vy, (ŠAT &. (6.28) 


Lll 
Then the approximate Hamiltonian can be presented as 
H, (x, Ù, ito, @,) = OT ay, F 5, (x, ù, it) = E, (6.29) 


where Y, = Vy, (f + giui (Xe) + gnit2(Xe)) — 0,1. 
In order to minimize e, in (6.29), we set the target of minimization as E = E; + 
E, = 1/ 2e? +1/ Je: Via applying gradient descent approach, we obtain 


T 1 Ta Y — 
(WT, +1)? 00, “(Ty +1)? * 


WDE (6.30) 


where y, is the adjustable parameter and w, = v,/(T dy, + 1}. Define ©, = w* — 
@,. From (6.30), we derive that 


Č, = -pph pT O, + yie, (6.31) 


where Y, = Y, / (Ty, + 1) and the approximated residual error e, = — Vof (f + 
git, + gott2) + 0,0, . For proceeding further, the following assumptions are required 
[11, 26, 27]. 


Assumption 6.2 For any: € {1, 2}, the signal 1), is persistently excited on the time 
interval [ż, t + T]. That is, there exists the positive constant by, such that 


t+T T 
Din INaxNa = f ppi ds, (6.32) 
ti 


with Na, being the neuron number of the zth critic network. 


Assumption 6.3 For: € {1, 2}, there exist positive constants such that |w* || < bu, 
V4 || < bn, IVl] < bo: and |e, || < ba. 


6.4 Stability Analysis 


In this section, the asymptotic stability of the controlled system is analyzed by apply- 
ing Lyapunov theory. Before presenting the main results, the boundedness of critic 
weight is discussed in the following lemma. 


Lemma 6.4 For any ı € {1,2}, suppose that Assumptions 6.2-6.3 hold and the 
initial weight is finite. If the critic tuning law (6.30) is applied, then it holds that 
w, is locally ultimately bounded. 
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Proof Consider the Lyapunov function as Lyw. It’s noted that the derivative of @, is 
flow dynamics, which indicates that there doesn’t exist any jumps in the values of 
w,. More specially, ©, is continuous at the regulating instant. Thus we only need to 
consider the time interval between two adjoining regulating instants. 


According to Assumptions 6.2—6.3, it can be derived that 
i = 2a oy + 20T dy 
= 29 (Gi did] & + õie) 
+ 2y (üp Č + Čez). (6.33) 


By applying Young’s inequation, we can get 


Lyu < — yı (Čip pi & — ef er) 


— y (Č p &2 — ef ez) 


< = qiblai? — y2by2ll@2ll? + T, (6.34) 
where NM = yıb? + 2b?,. Furthermore, when ||@ || > ae £ ba or lœ > 
a 5 £ bz,, it yields that L yw < 0. The lemma is proved. a 


Theorem 6.4 Consider the system (6.8) with strategy formulated as (6.28). Suppose 
that Assumptions 6.1—6.3 hold. The tuning law for critic network is given by (6.30). 
Then the state x and weight estimation error œ, are UUB provided that the indication 
is applied 


(1 — w7)Am(Y) 
lizel? < ato Pury = lIzell*, (6.35) 


with w and w2 being the adjustable parameters. 


Proof Select the Lyapunov function candidate as 


N=2 N=2 {N= 
— ky% * ~T ~ 
Ly = 3 Vi (Xe) + 3 Vœ) + 5 3 wy w 
=Lya + Lyp + Lye. (6.36) 


Due to the utilization of MDRM, we present the proof process in two cases. 
Case I: No regulation occurs, i.e., t € [Ae, ey). Then we obtain Ly, = 0. The 
derivative of Lyp can be obtained as 


N=2 
Lys = X VVT + gŭ + pù). (6.37) 


1=1 
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Let ù = [(ŭı — ut)", (@2 — u3)"]’. Applying the operations similar to that in 
Theorem 6.3, we have 
Lyp <—x'Yx4+aTVat+ Ov 

<= x Tx + oy +Au(Y)|le* — ùt + at — ùll? 
+ Au (Y) lus — w3 + ù — tall? 

<= x" Yx + ov + Am(Y)(1 + 1/m2)llet = čal? 
+ AMY) + wa) llat — aI)? 
+ Au YA + 1/2) Ile — aI? 
+ AMY) + w2)lley — 317. (6.38) 


Recall that 9 = 0; + @2, and substitute (6.27) and (6.28) into (6.38). Then we can 
derive 


Lyp < —x™ Vx + (1+ a) OAM(Y) |x — žel? 
+oyv+ h, (6.39) 


where D = tAy(Y)(1 + 1/0) (IRT lb? b2, bZ, + IR 1b2b2,b2,) + rs 
Amu (Y) + w) (IR II7B2,b2, + IR I7b2b22) with b,, and b,. denoting the 
bounds of known g; and gp. 

According to Assumption 6.2 and Assumption 6.3, we derive that 


Lye < —Vbyillill? — ybyell@2I? + Ti. (6.40) 


Based on (6.39) and (6.40), we can obtain 


Ly < — (1 — wm (VY) II? — An (1) (17 
+ (1+ w2)Am(Y)OI|x — Fell? — byla? 
= byll? + £, (6.41) 


where £ = I) + I + oy. Applying the indication (6.35), then we conclude that 
Ly < 0 when one of the conditions hold that 


1 E i 

Pea: x eee = Aa 6.42 

ecaa (6.42) 
£ 

II, || > a = fpr =l: (6.43) 
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Thus x and w, can be guaranteed to be UUB. 
Case IT: A regulation occurs, that is, t = he). The difference of Ly can be given 


by 


ALy = ALya + ALy, + ALy¢, (6.44) 


where the terms are defined by ALyg = Vj‘ (Xe41) — Vii (Xe) + VŽ (Xe41) — VŽ (Xe), 
ALys = Vita Pe) — VERa) + VE es) Vai) ALye = 
1/207 (Re+1)@1 (ep1) — 1/207 (Az) (ig) + 1/207 (epi) ey) — 1/207 
(Ap, )02 (hp, ;)- Recalling the analysis in Case I, we obtain that Ly < 0 when x or 
&, is out of the corresponding bound. Furthermore, we can derive that Ly, + Lyc is 
monotonically decreasing when ¢ € [/ig, Ae+1). In light of the properties of limits, 
we have 


0 <V Og, + ZAT Cia OMe.) 
— V,*(x(Req1)) — TaT erd (e1). (6.45) 
As x is proved to be UUB, we can obtain 
Vi" e1) < V* o). (6.46) 


According to (6.45) and (6.46), we can derive ALy < 0, which indicates that the 
selected Lyapunov (6.36) is monotonically decreasing when t = e41. This com- 
pletes the proof. E 


Remark 6.5 w; and w2 in (6.35) are the adjustable parameters which determine 
the frequency of medicine dosage regulation. A larger w or w2 leads to a higher 
regulation frequency, and a smaller parameter implies a lower adjustment frequency. 
Thus we can determine these parameters according to the clinical data. 


Remark 6.6 In thischapter, the approximate optimal combination therapeutic strat- 
egy is derived via ADP method to inhibit the proliferation of tumor cells under the 
mechanism of medicine dosage regulation. The MDRM is constructed on the foun- 
dation of the above-mentioned medicine indication (6.35). The data at the dosage- 
regulating instants should be recorded and will be utilized as reference data in the 
future. When the difference between the current clinical data and latest reference 
data is larger than the threshold, the medicine dosage can be regulated. Therefore, 
this mechanism can guarantee the derived therapeutic strategy to be regulated timely 
and necessarily. 
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Table 6.1 Parameter specifications of the cells 


Parameters | Operations Values Parameters | Operations Values 

a - 0.0068 day~! a2 = 0.01 day! 

a3 = 0.002 day~! a AC) 0.00702 day—! 
a A2C2 0.00072 day~! by By/C, 1.10 

bz Bo/C2 4.6205 b3 B3/C3 4.6666 

(o) ®C3/C2 0.1615 s1 = 0.001 

k KC2/C3 0.0037 1day~! 


Table 6.2 Parameter specifications of the drugs 


Parameters | Operations | Values Parameters | Operations | Values 

£10 Z10/C1 1.2 x 1077 day~! ĉi Z11C3/C1 | 4.2 x 1078 day! 
E12 212/C, 1.0 x 1077 day—! T10 Tho/C2 0.2051 day—! 

T11 TI1C3/C2 | 0.00431 day—! T12 T2/C2 19.4872 day! 
£20 Z2/C1 6.0 x 1078 day~! £1 271C3/C, | 2.2 x 107° day! 
£22 E09/C} 1.0 x 1078 day—! T20 Tho/C2 0.1251 day—! 

T21 TIy1C3/Cz | 0.00217 day! T2 T/C 15.7819 day—! 

& 23/03 1.7143 day—! mı - 0.0002 day~! 

m2 - 0.032 day—! m3 - 0.0004 day~! 

m4 - 0.028 day! ms - 0.032 day! 

Bet - 0.01813 day—! Ber = 0.01529 day™ ! 

Ba z 0.136 day—! 


6.5 Simulation Study 


In this section, the mathematical model (6.7) is considered which presents the rela- 
tions between cells and drugs. For simplicity, we have constructed the rephrased 
system (6.8) of which the control issue could be deemed as NZSGs. 

In light of the clinical medical statistics and literature [38], the parameters on cells 
and drugs for model (6.7) are given in Table 6.1 and Table 6.2, respectively. For the 
discounted value function (6.9) of system (6.8), the corresponding parameters are 
set as Ry; = 0.8hy2, Rio = 15h y2, Roy = Shy2, Rv = hy2, Yi = 0.0216x6 and 
N = 0.0676x6. In addition, the discounted factors 0; = 02 = 0.2. 

For the critic NNs, the activation functions are both set as [x?, X1X2, X1X3, X1X4, 
X1X5, X1X6, X; X2X3, X2X4, X2X5, X2X6, i X3X4, X3 X5, X3 X6, a X4X5, X4X6, a, X5X65 
tel and the learning laws are set by yı = 1.5 and y2 = 2. Besides, the parameters 
0 = 8, wı = 0.8 and wm = 8. 

The evolution curves of the model (6.7) are depicted in Fig.6.1. From Fig. 6.1 
we can observe that when t = 200d, the population of tumor cells reduces to zero, 
and when t = 600d, the population of normal cells almost returns to 1 and that of 
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Fig. 6.1 The evolutions of model states 
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Fig. 6.2 The therapy strategy curves of chemotherapy drug 1 


endothelial cells drops down to a small steady value. This indicates that the prolif- 
eration of tumor cells can be suppressed after 600 days under the optimal therapy 
strategy. In Figs. 6.1, 6.2, 6.3, 6.4, 6.5, 6.6 and 6.7, we compare the medicine dosages 
of the derived therapy strategy and that of initial therapy strategy. It indicates that 
the medicine dosages of our near-optimal therapy strategy are significantly less than 
the dosages of initial strategy. It’s of great practical significance since superfluous 
drugs may well affect the health of patients and impose additional financial burdens 
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Fig. 6.3 The therapy strategy curves of chemotherapy drug 2 
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Fig. 6.4 The therapy strategy curves of anti-angiogenic drug 


on patients. Besides, one can find that when the clinical data becomes better, the 
regulation frequency of the derived therapy strategy becomes lower. This implies 
that the therapy strategy based on medicine dosage regulation mechanism can be 
regulated with the indications for medicine timely and necessarily. Figures 6.5, 6.6 
and 6.7 present the curves of the cells under different therapy strategies, that is, 
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Fig. 6.5 The population of normal cells under different therapies 
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Fig. 6.6 The population of tumor cells under different therapies 
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Endothelial cells 
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Fig. 6.7 The population of endothelial cells under different therapies 


chemotherapy drug 1, chemotherapy drug 2, anti-angiogenic drug and the therapy 
comprised of these three drugs. We can conclude from Figs. 6.5, 6.6 and 6.7 that the 
therapeutic effect of the derived therapy is the best. Thus simulation results validate 
the effectiveness of our therapy strategy 


6.6 Conclusion 


In this chapter, an ADP-based method using medicine dosage regulation mechanism 
has been proposed to obtain the optimal combination therapy for curing cancer. 
A mathematical model is employed to describe the interactions among the normal 
cells, tumor cells, endothelial cells, chemotherapy drugs and anti-angiogenic drug. 
The mathematical model provides the foundation for us to solve the optimization 
issue under the architecture of NZSGs. The ADP method of single-critic framework 
is proposed to approximately seek the optimal strategy. In addition, the introduction 
of the medicine dosage adjustment mechanism guarantees the therapy strategy to 
be adjusted timely and necessary. Finally, the theory analysis and simulation results 
both indicate that the designed strategy can effectively decrease the population of 
tumor cells and endothelial cells with very few medicine dosage, which verifies 
the availability of the proposed method. Our future research direction is to seek 
the optimal strategy for decreasing tumor cells or other harmful cells with latest 
therapies, for example, the therapy applying oncolytic virus. 
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Chapter 7 A) 
Adaptive Virotherapy Strategy get 
for Organism with Constrained Input 

Using Medicine Dosage Regulation 

Mechanism 


7.1 Introduction 


Low efficacy and high toxicity for patients is the characteristics of traditional ther- 
apies as surgery, chemotherapy, and radiation, hence the most prosperous tumor 
treatment strategy, oncolytic virotherapy which depends on the virus with relatively 
weak pathogenicity and appropriate gene modification, simultaneously, the thera- 
peutic effect benefits from strong replication capabilities. Similar to the principle 
of targeted therapy, gene-modified viruses repressed selectively infect tumor cells 
(ITCs) through rapid replication increment, and ultimately destroy TCs, concur- 
rently, activate the body’s immune response. Soluble tumor virus therapy not only 
can kill TCs, but also attract more immune cells to kill residual cancer cells, how- 
ever, it doesn’t deplete normal cells in the body. Oncolytic virus (OVs) enjoyed the 
superiority of minimal side effects and optimal therapeutic effects compared with 
traditional treatment strategies as literature [1]. Development of oncolytic viruses 
benefit from the virus-specific lytic CTL response eliciting immunostimulatory sig- 
nals and contributing to killing of ITCs as literature [2], thus, viral doses, number of 
doses and timing with reliable mathematical models are the future research direction. 

To lucubrate cancer virotherapy, mathematical models which described mecha- 
nisms of TCs, OVs and immune cells have been proposed and updated as literatures 
[3, 4]. Literature [5] expounded the inner mechanism including uninfected tumor 
cells (UTCs), ITCs and free viruses. Successively, the infected cells and uninfected 
cells are distinguished through logistic growth of TCs and elimination of free recom- 
binant measles viruses as [6]. What matters most is the immune response which leads 
to inhibitory effect of viral therapy for misregarding of genetically modified viruses. 

Therapy efficiency depends on hyperimmunity or not, in other words, infected can- 
cer cells and viruses are swallowed for indistinguishability. Literatures has demon- 
strated the side effect of immune cells, and immunosuppressive agent cyclophos- 
phamide is chosen to reduce immune response [7]. Reference [8] has considered 
the virus-free population adding the previous three variables, reflecting interactive 
relationship between innate immune with infected cancer cells and the virus cells, 
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evolving into an effective mechanism analysis model, but more effective control strat- 
egy is in urgent need. Cytokines form natural killer cells contribute most strength on 
destruction of both tumor and virus-infected cells. The proposed model gives explic- 
itation of interplay among TCs, OVs, and immune response, which is the guideline 
of optimal therapeutic strategies or dosage regimen on oncotherapy. Although cor- 
relational research on regulation on immune system and TCs has been proposed 
using ADP as [9], selective oncolysis will enjoy optimal therapeutic effect through 
gene-modified viruses compared with wild-type OVs based on ADP method. 

As a vital branch in machine learning, obtaining information from interactive 
environment [10-12], reinforcement learning (RL) has been demonstrated to perform 
well in solving optimal control issues of nonlinear systems [13]. The ADP method, 
which was derived from RL and dynamic programming, generally attempts to obtain 
the optimal strategies with the aid of the classic critic-actor algorithm framework 
[14]. Under this architecture, the critic evaluates the cost when the current strategy is 
applied, and actor updates the control strategy in accordance with the feedback infor- 
mation provided by the critic. Thus the approximate optimal strategy can be derived 
and the “curse of dimensionality” can be obviated. Recently, ADP-based methods 
have been widely researched to tackle various optimal issues, for instances, tracking 
control [15—17], optimal consensus control [18—20], zero-sum games and nonzero- 
sum games [21-23]. Different from fuzzy approximation as [24], the robust dynamic 
NN was established to asymptotically identify the uncertain system with additive dis- 
turbances, and the critic and actor worked together to find the equilibrium solution 
for nonzero-sum games subject to nonlinear system. The identifier was developed to 
reconstruct the unknown dynamics and the critic was tuned by a concurrent learning 
strategy which could effectively use real-time data and recorded data such that the 
persistence of excitation (PE) condition could be removed. By utilizing both online 
and off-line data, a data-based policy gradient ADP method was developed to seek 
optimal scheme in [25]. To address global optimum control issue and avoid falling 
into local optimality as[26], the ADP method which combined with the predesigned 
extra compensators was proposed in [27]. The introductions of these compensators 
contributed to deriving the augmented neighborhood error systems, thus the system 
dynamics requirement for ADP was avoided. In [28], integrating the neural network 
learning ability and the spirits of ADP, a general architecture of intelligent critic 
control was proposed to solve the robustness issues of disturbed nonlinear systems. 

As saturation phenomena which exist widely in many practical systems can affect 
the system performance, multifarious ADP-based method were proposed to achieve 
optimal control with input constraints [29-31]. For the tumor-virus-immune system 
in this , the control input is the medicine containing the virus particles. Redundant or 
insufficient medicine dosages may well influence the therapeutic effect or patients’ 
health. Thus we consider the asymmetric input constraints and construct the cor- 
responding non-quadratic value functions associated with the tumor-virus-immune 
system. 

Recently, ADP-based methods have been proposed to develop approximate opti- 
mal strategies in various practical applications [32-35]. However, there exist sel- 
dom any literatures associated with optimal strategy based on virotherapy which is 


7.2 Problem Formulation and Preliminaries 117 


derived from ADP-based methods. Enlightened by the literatures mentioned above, 
we design the virotherapy-based optimal strategy via ADP method with MDRM. 
The contributions can be stated as follows. Firstly, the mathematic model is intro- 
duced to simulate the relationships between TCs, OVs and immune cells. Due to 
the asymmetric dosage constraints for medicine, a non-quadratic utility function is 
constructed to form the discounted value function. Then, on the basis of the tumor- 
virus-immune model, ADP method of single-critic architecture is proposed to solve 
HJBE such that the approximate optimal strategy can be achieved, which means that 
the TCs can be largely eliminated with the constrained optimal virotherapy-based 
strategy. Furthermore, the reasonable the medicine dosage regulation mechanism is 
firstly introduced into this algorithm framework, and the indications for medicine is 
considered for the first time. Finally, theoretical analysis and simulation experiments 
both validate the effectiveness of the designed therapeutic strategy. 


7.2 Problem Formulation and Preliminaries 


7.2.1 Establishment of Interaction Model 


In the section, tumor-virus-immune interaction model is introduced to describe the 
relations between TCs, viruses and immune cells. Due to the behavior of OVs, we can 
divide TCs into UTCs and ITCs. In the model composed of four ordinary differential 
equations as follows, Pry(t), Pr7(t), Py;(t) and P;y(t) respectively denote the 
populations of UTCs, ITCs, free OVs and immune cells. 

The population of UTCs can be affected by multiple factors, that is, the multi- 
plication and apoptosis of TCs, the infection by OVs and the reduction caused by 
immune cells. Moreover, the growth dynamics of UTCs is presented as 


Pru(t) =A Pro (1 OEO) Ay Pru Pritt 


— Bı Pru(t)Pim(t) — Ci Pru(t), (7.1) 


where A, is the tumor proliferation rate, A is the infection rate of virus, Bı denotes 
the killing-efficiency of immune cells, and C; is the apoptosis rate of UTCs. 
Similarly, the population of ITCs can be modeled by 


Pri (t) = Az Pru (t) Pyi(t) — B2Pri(t) Pru (t) — ePri(t), (7.2) 


where B2 denotes the immune killing-efficiency of ITCs and y is apoptosis rate of 
ITCs. 

The lysis of ITCs which contain multiple replicated virion particles and the input 
of virus agentia can both contribute to the rise of the free virus population. Thus the 
evolution dynamics of virus population can be presented as 
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Py (t) =U + KpPri(t) — Ar Pru (t) Pyi(t) 
— B3Py;(t) Pru (t) — CoPyr(t), (7.3) 


where U denotes the input of agentia, « the burst size of free viruses, B3 the immune 
killing-efficiency rate of OVs, and C3 the clearance rate of OVs. 
The immune response dynamics can be formulated as 


Pru (t) =D, Prr(t) Pru (t) + D2 Pru (t) Pim (t) 
— C3Piy(t), (7.4) 


where Dı and D, are immune response rates stimulated by infected and uninfected 
cells. And C3 is the apoptosis rate of immune cells. For purpose of simplifying the 
interaction model, we utilize the nondimensionalization technique [36, 37] to derive 
the simplified version as 


Pru(t) =a pru@)C — pru(t) — pri(t)) — cı pru (t) 
— apru (t)pvr(t) — bipru(@) pim(@) 

Pri(t) =a pru (t) pvit) — bopri(t)pim(t) — pprt) 

Pvit) =u + Kpri(t) — apru (t)pvr (t) (7.5) 
— b3pvi(t) prim (t) — c2 pvı(t) 

Pim) = pri) prm(t) + d prut) Pim) 
— c3 prm(t). 


Herein the nonnegative states of nondimensionalization version are represented 
as pru (t), pri(t), pvi(t) and prm (t). 


Remark 7.1 In virotherapy, the viruses achieved their reproductive objective by 
infecting tumor cells and replicating themselves. After the lysis of infected cells, 
new reproductions burst out and infect other tumor cells. Under this mechanism, the 
tumor cells can be effectively eliminated. Furthermore, comparing with uninfected 
tumor cells, the infected cells can activate immune cells more effectually to kill tumor 
cells. 


7.2.2 Problem Formulation 


Consider the system (7.5) as 


x= f(x)+ gu, (7.6) 
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where g = [0, 0, 1, 0]7, and f(x) is constructed by the right-hand side parts of (7.5) 
excluding the control input u. u € [um, um] where um and um denote the minimum 


and maximum thresholds for medicine input dosage. 
For system (7.6), the corresponding discounted value function is defined as 


[e6] 
V(x(t)) = f e WO W(x, w)dt, (7.7) 
t 
with the discounted factor 0 > 0. The utility function is given by 


Wx, u) =x x + xu), (7.8) 


where the matrix Y is positive definite, and x(u) is non-negative function. It’s noted 
that for system (7.6) the input constraints are not symmetric. In order to cope with 
this issue, function x(u) is defined as 


x(u)=2ħ / i YA a —a))de, (1.9) 


where œ = (Um + um)/2 and h = (um — tm)/2. WC) is a monotonic odd function 

which is continuously differential with ~(0) = 0. Without loss of generality, we 

select the hyperbolic tangent function as 7(-), that is, Y(-) = tanh(-). 
Differentiating the value function (7.7) along system (7.6), we obtain that 


O= VV (f + gu) +x? 7x + xu) — OV. (7.10) 
Then the Hamiltonian function can be expressed as 
H(x,u, VV) = VVT (f + gu) +x" Yx + x) — OV. (7.11) 
The optimal value function is defined as 
oe) 
V*(x) = min I e WO W(x, u)de. (7.12) 
u t 
which satisfies HJBE 
min H(x,u, VV*) = 0. (7.13) 
Applying the stationary condition, we can derive the optimal strategy as 
* 1 T * 
uw = —htanh(>_9 VV")+a. (7.14) 


On the basis of (7.13) and (7.14), we rewrite the HJBE as 
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1 
(VV*) f- hV V*)” g tanh(5— 9" VV") +x'Yx 
+(VV*)' ga — 0V* + y(u*) = 0. (7.15) 


Remark 7.2 In the conventional optimal control issue with control constraints, it’s 
often required that the input constraints should be symmetric. Nevertheless, the pro- 
posed method in this takes the asymmetric input constraints into account. Thus the 
symmetric constrained condition is relaxed by constructing the unconventional utility 
function (7.8). 


Due to the nonlinear nature of (7.15), it’s often intractable to derive the analytical 
solution, which is requisite for designing the optimal strategy. To overcome this 
issue, in the following sections, ADP method of single-critic network using dosage 
regulation mechanism is designed to approximately solve (7.15). 


7.3 Optimal Strategy Based on MDRM 


In order to achieve the goal of regulating therapeutic strategy timely and necessarily, 
MDRM is introduced to provide indications for medicine to determine the time when 
it’s necessary to make some regulation. Therefore, the time sequence {z,} is required 
to record the regulating instants. The parameter 1 € N* represents the 1th updating 
instant and N* is the set including all positive integers. Then we can define the state 
as 


XO) = x (z), t € La, Z41)- (7.16) 


In general, the clinical data after the latest regulation is different from the current 
comparable data. Hence the error is given by 


v, (t) = x, — x(t), t € [zi B41). (7.17) 


Based on v, and the threshold associated with state x, the medicine regula- 
tion mechanism is established. When a regulation occurs, v, = 0, which means 
the medicine dosage is regulated to be equal to the current medicine indication. 
The comparable data is updated by the clinical data at regulation instant, and the 
medicine dosage remains unchanged until the occurrence of the next regulation. 
That is, ù = w(x,). Thus we derive the MDRM-based strategy as 


1 
ù* = —htanh(5— 9" &)VV"(&)) +a, (7.18) 


where VV* = OV* /Ox when t = z,. Then the medicine regulation mechanism- 
based HJBE can be denoted as 
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1 
H(x, i", V*) = — RCV V*)" g tanh(=—9" V V*Č)) 
+(VV*) f 4(VV*) gat x7 Vx 
+ y(a*) — OV*. (7.19) 
The existence of the error v, lead to that (7.19) does equal to 0, which is different 
from HJBE (7.15). Before proceeding, an assumption is necessary [31]. 


Assumption 7.1 The optimal strategy u* is locally Lipschitz with respect to error 
V, ie., u* — a*||? < Ky ||x — X,||? = Kullu, ||? where K, is a positive constant. 


Theorem 7.1 Consider the nonlinear system (7.6). Suppose that Assumption 7.1 
is tenable and there exists function V* satisfying (7.15). If the optimal strategy is 
formulated as (7.18) with the medicine indication 


ad ~ Cyn (Y) 


Ixl? (7.20) 


2 
lvlt < 


where ¢ € (0, 1) is the designed parameter, then the controlled system is guaranteed 
to be asymptotically stable in the sense of UUB. 


Proof Select the Lyapunov function Y = V*(x). Then we can obtain the derivative 
of V* 


Y = (VVT (S + gù”). (7.21) 
According to (7.14) and (7.15), we derive that 
(VVS f = —(VV*) gu* — x7 Vx — y(u*) + OV", (7.22) 
and 
(VV*)? g = —2A(tanh7!((u* — a)/h))’. (7.23) 
Then (7.21) can be rewritten as 


ř=- VY ee — ù*) — xT Yx — x(u*) + 0V* 
= — 2ħ(tanh™! ((u* — a)/h))? (ùŭ* — u*) — x7 Yx 
— x(u*) + 0V* 
=—x'Yx— x(u*) + 0V* +0, (1.24) 


where w = —2ħ(tanh™! ((u* — a)/h))" (ù* — u*). Due to Young’s inequality, from 
(7.24) we derive 


w < h (tanh7! ((u* — a)/A))? + Kall I- (7.25) 
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Via variable substitution approach, we have 


x(u*) = 2h / _ tanh“! (U — a)/h)d(e — a). (7.26) 
0 


The function (7.26) can be further expressed as 


tanh! ((u*—a)/h) 
(ut) <2? f (1 — tanh?(s))ds 
0 


tanh™! ((a*—a)/h) 
— ae f ç tanh? (ç)dç 
+ E T —a)/h))’. (7.27) 
Based on (7.24), (7.25) and (7.27), we can obtain 
¥ <2) + Kalin? +0v*—x' rx, (7.28) 


=l Mis . “14° . . 
where £1 (x) = 2h? po ee) ¢tanh?(s)ds. Via utilizing integral mean-value 


theorem, we derive that 
E(x) = 2h? tanh! ((u* — a)/h)p tanh? (p), (7.29) 


where p € (0, tanh~!((a* — a)/h)). As u* is admissible, it can be deduced that V* 
and VV* are bounded. Let ||V*|| < by and ||VV*|| < byy with by and byy being 
positive constants. Then (7.29) becomes that 


E(x) <2h? tanh! ((u* — a)/h)p 
<2h? (tanh! ((u* — a)/h))" 
1 
1 
=5b;bvv £ bz, (7.30) 


where the positive constant b, denotes the bound of g(x). According to (7.28) and 
(7.30), it can be obtained that 


¥<- An ONE — A = A An OONN 
+ Kullu ||? + Obv + ba,. (7.31) 


When the indication (7.20) is satisfied, it yields that ř < —C An Olx + 
Oby +b, = 


Oby + bz,. Then we can conclude that Y <0 when |x|] > Zur)" 
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Theorem 7.1 indicates that with the utilization of medicine regulation mecha- 
nism, the MDRM-based optimal strategy can asymptotically stabilize the controlled 
system. 


7.4 MDRM-Based Approximate Optimal Control Design 


The approximate optimal control strategy is designed based on the ADP algorithm 
which integrates the medicine regulation mechanism. Furthermore, for the closed- 
loop controlled system, the asymptotically stability in the sense of UUB is guaranteed 
when the proposed medicine indication is applied. 


7.4.1 Implementation of the Adaptive Dynamic Programming 
Method 


In this section, the approximate optimal strategy is designed by the ADP method of 
single-critic framework which integrates the medicine regulation mechanism. 

Based on the universal approximation properties of NN, V* can be represented 
as 


V* =u V(x) +7, (7.32) 


where w* is the ideal weight vector, 17(-) the activation function and 7 the approximate 
error. Let I (X,) = 49" (,) VU" (X,)w, then we have 


a* = —Atanh(I\(%,)) + T%,) +a, t € [z 41) (7.33) 
where 7(x,) = —(1/2)(1 — tanh?(®(%,)))g7 (¥,)V7(%,). Herein, ©(%,) is selected 
between 1/(2h)g7 (x,)VV*(x,) and Tı (X,). As the ideal weight w* is unknown, the 
approximate version of V* is derived by the critic NN, which is presented as 


V =0'0(x), (7.34) 


where is the approximate vector. Then the MDRM-based approximate strategy 
can be obtained 


a = —ħ tanh (D (X,)) + a, t € [z,, 241), (7.35) 


where I>(X,) = 1/(2A)g7 (,) V0" (x,). Then the approximate Hamiltonian could 
be restated as 
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H(x, ù, 0) = O7E+x7Vx4+x@ Zen, (7.36) 
where € = VU(f + gu) — OV. 


The goal of tuning & is to minimize the term £y. Thus we set the target function 
as E = ele H. Using the gradient descent approach, we obtain 


£ $ EH = 
(ETE+ 1)? 


o= 


lEen, (7.37) 


where £ is the learning parameter and č = €/(€7E4+ 1)°. Define & = w* — &. From 
(7.37) we derive that 


© = LEET O + LÈey, (7.38) 


where € = €/(€7 € + 1) and the approximate residual error ey = —Vt! (f + git) + 
OT. Before presenting the main results, the following assumptions are requisite [38, 
39]. 


Assumption 7.2 The signal £ is persistently excited over the time interval [ż, t + T]. 
In another word, there exists the positive constants @ and T such that 


t+T E 
bly.xn. < / Ef" du, (7.39) 
t 


with N. being the neuron number of the critic network. 


Assumption 7.3 The terms 7 and ey are both bounded. That is, ||7|| < b+ and 
ey || < bey where bz and bey are positive constants. 


7.4.2 Stability Analysis 


This section discuss the asymptotic stability of the controlled system with the 
designed DARM-based strategy. 


Theorem 7.2 Consider system (7.6) and let Assumptions 7.1-7.3 hold. The strategy 
is given by (7.35) and the weights tuning law for critic is set as (7.37). Then the 
closed-loop system (7.6) and weight estimation error © are asymptotically stable in 
the sense of UUB provided that the medicine indication is applied 


1 = PAn A 
Can E I 


2 
lvlt < 
2K,„ 


T, | (7.40) 


with ņ € (0, 1) being the regulation parameter. 
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Proof Select the Lyapunov function as 
Y = V*(%,) + V* (x) + 00l = Y, + Yp + Y.. (7.41) 
Note that when medicine indication is applied, the system can be described by the 
impulsive model comprising two components. One is flow dynamics fort € [z;, Zı+1) 
and the other is jump dynamics for t = z,. Hence we present the discussions over 
the two cases. 


Case I: No regulation occurs, i.e., t € [Z,, Z,41). Then we can obtain Ý, = Q. In 
light of (7.22) and (7.23), we could derive that 


Ý, =(VV*)"(f + git) 
= — x(u*) — xT Yx + 0V*, (1.42) 


where 5) = —2ħ(tanh™! ((u* — a)/h))’ (ù — u*). According to Young’s inequa- 
tion, we have 


Ey < h?|| tanh™!((u* — a)/h) ||? + ù — a*|I?. (7.43) 
Recalling (7.27), we obtain 
E — x(u“) < Ey (x) + lŭ — u* |’. (7.44) 
As Æ (x) and V*(x) are bounded, (7.42) becomes 
Ý, < |ù — u* ||? + bs, + Oby — x7 Yx. (7.45) 
Applying the Young’s inequation, we derive that 


ù — u*l =||ŭ — ŭ* + ŭ* — u* ||? 22a — ŭ* |? + 2lŭ* = u* ||? 
<4||ñ tanh (T (,)) — A tanh (D ŠD? + HFI? + 2K A? 
<8? tanh? (I) (%,)) + 8h? tanh? (M (X,)) + 2Ku||4 ||? +4b2. (7.46) 


As | tanh(-)| < 1, it could be obtained that 
Pp < An Olx? + 2K lal? + o, (7.47) 


where o = 16h? + Ab2 + Oby + bs.. 
Taking the derivative of Y., we derive that 


Ý, = —207 EET O + 207 Een. (7.48) 


In light of Young’s inequation, it yields that 
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20T Een < 200" ey < OEE + elep. (7.49) 
Then (7.48) can be further expressed as 
Ý. < -OT EET O + een < —Am (MAI? + Bey, (7.50) 
where ô = €€7. 
According to (7.47) and (7.50), when the medicine indication (7.40) is satisfied, 


we can derive that 


Ý <-(1l—9)Am(Y)IKI? = An (YI? + 2K lls? 
= Am (5) |||? F by + oO 
<= Amn OONAN? — Am (OZ? + by + o. (7.51) 


Then it can be concluded that Y < 0 when one of the conditions holds that 


1 [by +o 
xl > —,/— (7.52) 
Mn A AnD) 
and 
` by +o 
£ ; 7.53 
loll > nO (7.53) 


Thus x and w are demonstrated to be UUB. 
Case IT: A regulation occurs, i.e., t = z,. The difference of Ly is presented as 


AY = V* (X41) — V* Š) + VEZ) — V* (œz) 
e L l 
AYa AY, 
= ONION — O ; (7.54) 


—_—_—_ ama 
AY, 


From the analysis in Case I, it can be derived that Ly < 0 when (7.52) or (7.53) 
is satisfied. It can be further deduced that Y, + Y, is monotonically decreasing when 
t €[z,, 2,41), that is, 

Yp(x(Z1)) + Vex (Zi) = Yor +0) + Va +), (7.55) 


where e € (0, z,41 — z,). According to the properties of limits, we can obtain 


Yu (x(z1)) + Ye(x(z.)) = YEE) + Ye), (7.56) 
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with x(z+) = lim.+o x(z, + ©). More specially, it yields that 


1. M 1. x 
V* (x (z:)) + pe" eG) > V¥ (zt) + gË DORP). (7.57) 
As x is proved to be UUB, it can be obtained that 
V* 41) < V“). (7.58) 


From (7.57) and (7.58), it’s derived that AY < 0, which indicates that the con- 
structed Lyapunov (7.41) is monotonically decreasing when t = z,. a 


Remark 7.3 Ç in (7.40) is the regulation parameter determining the frequency of 
medicine dosage regulation. A large ¢ means that the medicine dosage is regulated 
frequently while a small ¢ implies the regulation occurs rarely. It can be set as an 
appropriate value according to the clinical data. 


Remark 7.4 Theorem 7.2 indicates that the designed MDRM-based approximate 
optimal strategy (7.35) can asymptotically stabilize system (7.6). The medicine indi- 
cation (7.40), the cornerstone of MDRM, can provide a reasonable reference thresh- 
old for therapeutic strategy. When the difference derived from the current clinical 
data and latest reference data is larger than the threshold, the medicine dosage can 
be regulated, and the current indication data will be recorded and utilized as the new 
reference data in the future. Thus the designed therapeutic strategy can be regulated 
timely and necessarily according to the medicine indication. 


Remark 7.5 The discount factor is programmed to avoid infinity and infinitesimal 
value function in the accumulation of rewards, and immediately return can earn more 
than the delayed return of interest. In human trials, we have found that human prefer 
to immediately return can present close to exponential growth, the discount factor is 
used to simulate such a cognitive model and biological process to make a decision. 


7.5 Simulation Study 


In this section, we consider the system (7.6) which is the simplified version of the 
growth dynamics of cells and viruses described by (7.1)-(7.4). Based on system (7.6), 
the simulation experiment is conducted to show the effectiveness of the proposed 
ADP method with medicine regulation mechanism. 

According to the clinical medical statistics and literatures [36, 37, 40], the param- 
eters associated with the dynamics (7.1)-(7.4) are presented in Table 7.1. After the 
nondimensionalization, the corresponding parameters are set as aj = 0.36, a2 = 0.1, 
bı = 0.36, b2 = 0.48, b3 = 0.16, cy = 0.1278, c2 = 0.2, c3 = 0.036, dı = 0.6, and 
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Table 7.1 Parameter specifications of the tumor-virus-immune system 


Parameters Descriptions Values 

Al Tumor proliferation rate 2x 1072h! 

A2 Infection rate of virus 7 x 107! mm3/h 
Bı Killing-efficiency of immune cells 2 x 1078 mm3/h 
B2 Immune killing-efficiency of infected tumor cells 2 x 1078 mm? /h 
Cı Apoptosis rate of uninfected tumor cells 0.0071 h7! 

C2 Clearance rate of viruses 0.0119h7! 

C3 Apoptosis rate of immune cells 0.002 h7! 

Dı Immune response rate stimulated by infected cells 5.6 x 1077 mm? /h 
D2 Immune response rate stimulated by uninfected cells | 5.6 x 1077 mm?/h 
p Apoptosis rate of infected tumor cells 0.056h7! 

K Burst size of free virus 9.0 


The population of uninfected tumor cells 
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Time(day) 


Fig. 7.1 The population evolution of uninfected tumor cells 


300 


də = 0.29. The initial state vector is [0.8, 0, 0.2, 0.05]’. The minimum and max- 
imum thresholds are given by um = 0 and wy = 0.02. For the discounted value 
function (7.7) of system (7.6), the parameters Y = 0.214x4 and 0 = 0.5. 

For the critic network, we select the activation function as [x?, X1X2, X1X3, X1X4, 
oe X2X3, X2X4, x$ , X3 X4, rd ae The other parameters are respectively set as K,, = 20, 
¢ = 0.9 and £ = 1.6. 

Simulation results demonstrate that in Figs.7.1, 7.2, 7.3, 7.4, 7.5, 7.6 and 7.7. 
For model (7.5), the evolution trajectories of states are respectively depicted in 
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Fig. 7.2 The population evolution of infected tumor cells 
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Fig. 7.3 The population evolution of free oncolytic virus 
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Figs. 7.1, 7.2, 7.3 and 7.4. From Fig.7.1, we could observe that under the attacks 
from oncolytic viruses and immune cells, the population of uninfected tumor cells 
rapidly declines and reaches a stabilizing value which is very low after t = 150d. 
Figures 7.2 and 7.3 reveal the relations between the population of infected tumor cells 
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The population of immune cells 
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Fig. 7.4 The population evolution of immune cells 


The medicine dosages of therapy strategies 
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Fig. 7.5 The curves of the therapeutic strategies 
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Fig. 7.6 The population evolutions of uninfected tumor cells 
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Fig. 7.7 The population evolutions of infected tumor cells 
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and that of virus particles which is large proportional. The immune cells are activated 
by the uninfected and infected tumor cells to kill tumor cells, which can be observed 
from Fig.7.4. The medicine dosage of the derived approximate optimal therapeutic 
strategy and that of initial strategy are compared in Fig. 7.5. From Fig. 7.5, one can 
derive that the dosage of the obtained strategy is obviously less than that of initial 
strategy. On the other hand, the input dosages of the two strategies are both con- 
strained by the pre-designed thresholds. This is of great practical significance since 
excess medicines may well threaten the health of patients and cause a huge over- 
head. Furthermore, it can be observed that the medicine dosage regulation frequency 
steps down when the clinic data becomes better, which means that with the aid of 
medicine regulation mechanism, the medicine dosage can be regulated timely and 
necessarily. Figures 7.6 and 7.7 present the population curves of the cells and viruses 
under the derived strategy with different burst sizes of viruses, that is, k = 2, 5. 
This verified that the obtained therapeutic strategy can effectively kill tumor cells 
with oncolytic viruses of different burst out sizes. However, when the parameter « 
is large enough, it may cause an oscillation. When the innate immune response is 
considered, the tumor-virus-immune system becomes very complicated. Though the 
viruses with large « try their best to produce more replicas and infect more tumor 
cells, the reduction of tumor cells inactivate the immune response in the meanwhile. 
The viruses dominate the dynamics and the warfare between tumor cells and viruses 
can last a long time such that the oscillation occurs repeatedly. The oncolytic virus 
has the ability to effectively kill the tumor cells, while the immune response can 
reduce the killing-efficiency of the viruses and block their infections. Furthermore, 
the activated immune response can eliminate tumor cells as well. Thus there exists 
a subtle balance between the viruses and the immune cells which demands a further 
investigation. 


7.6 Conclusion 


Medicine regulation mechanism has been designed such that the constrained thera- 
peutic strategy based on virotherapy can be obtained to eliminate tumor cells, guar- 
anteeing that the medicine dosage can be regulated timely and necessarily. Firstly, a 
mathematical model is utilized to describe the relations among the uninfected tumor 
cells, infected tumor cells, oncolytic viruses and immune cells. Meanwhile, as the 
simplified version of the tumor-virus-immune model, the non-quadratic function is 
proposed to formulate the value function to acquire HJBE. Secondly, to address 
the optimal therapeutic strategy, single-critic architecture has been designed to seek 
the approximate solution of the HJBE through ADP. Finally, the simulation results 
has verified the effectiveness of the proposed method. Furthermore, nonzero-sum 
optimal control based on differential games will be a edge of the new frontier in 
therapy of tumor treatment, cardiovascular, orthodontic treatment, osteoporosis and 
cerebrovascular diseases. 
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